Thomasleeper R Tutorials

Analysis of Variance
Omnibus test
Post-hoc tests
Treatment group summaries
Treatment group plots

Basic Math Tutorial

Basic Univariate Statistics
Simple statistics
Summary

Binary Outcome GLM Effects Plots
Predicted Probability Plots
Marginal Effects Plots

Binary Outcome GLM Plots
Predicted probabilities for the logit model
Log-odds predictions for logit models

Bivariate Regression
Regression on a binary covariate
Regression on a continuous covariate
Regression on a discrete covariate
Discrete covariate as factor

Character String Manipulation
paste
strsplit
nchar and substring

Colors for Plotting
Color Vector Recycling
Color generation functions
Color as data ##

Comments
Hash comments
Ignoring Code blocks
R comment function

Correlation and partial correlation
Correlation
Correlations of non-linear relationships
Partial correlations

Count Regression Models

Data Archiving
The Dataverse Network
The dvn package
Searching for data using dvn

Dataframe rearrangement
Column order
Row order
Subset of rows
Splitting a dataframe
Sampling and permutations

Dataframe Structure
print, summary, and str
head and tail
edit and fix

Dataframes
Dataframe indexing
Modifying dataframes
Combining dataframes

Exporting results to Word
Base R functions
The rtf Package

Factors
Converting from factor class
Modifying factors

Heteroskedasticity-Consistent SEs for OLS

Interaction Plots
Three-Dimensional Interaction Plotting

Lists
Positional indexing of lists
Named indexing of lists
Modifying list elements
Converting a list to a vector (and back)

Loading Data
General Notes
Built-in Data
Manual data entry
Loading tabular data
Reading .RData data
Loading “Foreign” data
Reading Excel files
Notes on other data situations

Local Regression (LOESS)
Fitting and visualizing local regressions

Logicals
Set membership
Vectorization

Matrices
Matrix indexing
Diagonal and triangles
Matrix names

Matrix algebra
Scalar addition/subtraction
Scalar multiplication/division
Matrix comparators, logicals, and assignment
Matrix Multiplication
Cross-product
Row/column means and sums

Missing data

Missing data handling
Local NA handling
Regression NA handling
Global NA handling

Model Formulae
Formula basics
Interaction terms
Regression formulae
Factor variables
As-is variables
Formulae as character strings
Advanced formula manipulation

Multinomial Outcome Models
Predicted values from multinomial models

Multiple imputation
Amelia
mi
mice
Comparing packages

Multivariate Regression
Regresssion formulae for multiple covariates
Regression estimation
Extracting coefficients
Regression summaries
Plots of Multivariate OLS Models

Numeric Printing
False Precision
signif and round
digits options
sprintf

OLS as Regression on Means

OLS Goodness of Fit
R-Squared
Standard Error of the Regression
Formal model comparison
Quantile-Quantile (QQ) plots

OLS in matrix form

OLS interaction plots
Plots for identifying interactions
Start with the raw data
Predicted outcomes
Incorrect models (without constituent terms)

Ordered Outcome Models
Estimating ordered logit and probit models
Predicted outcomes for ordered models
Predicted probabilities for ordered models
Alternative predicted probability plot

Permutation Tests
library(coin)

Plots as data summary
Histogram
Density plot
Barplot
Dotchart
Boxplot
Scatterplot

Plotting regression summaries
Plotting regression slopes

Power, Effect Sizes, and Minimum Detectable Effects
Factors influencing power
Power of a t-test
Minimum detectable effect size
calculate power for a one-tailed test and plot:
note how the MDE is larger than the smallest effect that would be considered “significant”:
Power in cluster randomized experiments

Probability distributions
Density functions
Cumulative distribution functions
Quantile function
Other distributions

R object classes
Numeric
Character
Factor
Logical

R Objects and Environment
Listing objects
Viewing individual objects
Object class
str
summary
Structure of other objects

Recoding
Recoding missing values
Recoding based on multiple input variables

Regression-related plotting
Multivariate OLS plotting

Regression coefficient plots
Plotting Standard Errors
Plotting Confidence Intervals
Comparable effect sizes

Regular expressions

Saving R Data
dput (and dget)
dump (and source)
write.csv and write.table
Writing to “foreign” file formats

Scale construction
Simple scaling
Using indexing in building scales

Scatterplot Jittering

Scatterplot with marginal rugs

Standardized linear regression coefficients

Tables
Table margins
Proportions in crosstables

The curve Function

Variables
Variable naming rules

Vector Indexing
Positional indexing
Named indexing
Logical indexing
Blank index

Vectors
Vector classses
Empty vectors



Analysis of Variance

One of the most prominent classical statistical techniques is the Analysis of Variance (ANOVA). ANOVA is an especially important tool in experimental analysis, where it is used as an omnibus test of a null hypothesis that mean outcomes across all groups are equal (or, stated differently, that the outcome variance between groups is no larger than the outcome variance within groups). This tutorial walks through the basics of using ANOVA in R. We'll start with some fake data from an imaginary three-group experiment:

set.seed(100)
tr <- rep(1:4, each = 30)
y <- numeric(length = 120)
y[tr == 1] <- rnorm(30, 5, 1)
y[tr == 2] <- rnorm(30, 4, 2)
y[tr == 3] <- rnorm(30, 4, 5)
y[tr == 4] <- rnorm(30, 1, 2)

Omnibus test

The principle use of ANOVA is to partition the sum of squares from the data and test whether the variance across groups is larger than the variance within groups. The function to do this in R is aov. (Note: This should not be confused with the anova function, which is a model-comparison tool for regression models.) ANOVA models can be expressed as formulae (like in regression, since the techniques are analogous):

aov(y ~ tr)
## Call:
##    aov(formula = y ~ tr)
## 
## Terms:
##                     tr Residuals
## Sum of Squares   251.2    1196.4
## Deg. of Freedom      1       118
## 
## Residual standard error: 3.184
## Estimated effects may be unbalanced

The default output of the aov function is surprisingly uninformative and we should instead use summary to see a more meaningful output:

summary(aov(y ~ factor(tr)))
##              Df Sum Sq Mean Sq F value  Pr(>F)    
## factor(tr)    3    297    99.0    9.98 6.6e-06 ***
## Residuals   116   1151     9.9                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This output is precisely what we would expect. It shows the “within” and “between” sum of squares, the F-statistic, and the p-value associated with that statistic. If significant (which it is in this case), we also see some stars to the right-hand side.

Another way to see basically the same output is with the oneway.test function. It conducts a one-way ANOVA, whereas aov is flexible to alternative experimental designs:

oneway.test(y ~ tr)
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  y and tr
## F = 39.38, num df = 3.00, denom df = 54.25, p-value = 1.191e-13

The oneway.test function allows us to control whether equal variances are assumed across groups with the var.equal argument:

oneway.test(y ~ factor(tr), var.equal = TRUE)
## 
##  One-way analysis of means
## 
## data:  y and factor(tr)
## F = 9.983, num df = 3, denom df = 116, p-value = 6.634e-06

I always feel like the F-statistic is a bit of a let down. It's a lot of calculation to be reduced to a single number (the F-statistic), which really doesn't tell you much. Instead, we need to actually summary the data - with a table or figure - in order to actually see what that F-statistic means in practice.

As a non-parametric alternative to the ANOVA, which invokes a normality assumption about the residuals, one can use the Kruskal-Wallis analysis of variance test. This does not assume normality of residuals, but does assume that the treatment group outcome distributions have identical shape (other than a shift in median). To implement the Kruskal-Wallis ANOVA, we simply use kruskal.test:

kruskal.test(y ~ tr)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  y by tr
## Kruskal-Wallis chi-squared = 36.96, df = 3, p-value = 4.702e-08

The output of this test is somewhat simpler than that from aov, presenting us with the test statistic and associated p-value immediately.

For more details on assumptions about distributions, look at the tutorial on variance tests.

Post-hoc tests

Post-hoc comparisons are possible in R. The TukeyHSD function is available in the base stats package, but the multicomp add-on package offers much more. Other options include the psych package and the car package. In all, it's too much to cover in detail here. We'll look at the TukeyHSD function, which estimate's Tukey's Honestly Significant Difference statistics (for all pairwise group comparisons in an aov object):

TukeyHSD(aov(y ~ factor(tr)))
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = y ~ factor(tr))
## 
## $`factor(tr)`
##        diff    lwr     upr  p adj
## 2-1 -0.8437 -2.963  1.2759 0.7278
## 3-1 -1.2482 -3.368  0.8714 0.4200
## 4-1 -4.1787 -6.298 -2.0590 0.0000
## 3-2 -0.4045 -2.524  1.7151 0.9595
## 4-2 -3.3350 -5.455 -1.2153 0.0004
## 4-3 -2.9304 -5.050 -0.8108 0.0026

One can always fall back on the trusty t-test (implemented with t.test) to compare treatment groups pairwise:

t.test(y[tr %in% 1:2] ~ tr[tr %in% 1:2])
## 
##  Welch Two Sample t-test
## 
## data:  y[tr %in% 1:2] by tr[tr %in% 1:2]
## t = 1.896, df = 34.2, p-value = 0.06646
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.06051  1.74794
## sample estimates:
## mean in group 1 mean in group 2 
##           5.029           4.185
t.test(y[tr %in% c(2, 4)] ~ tr[tr %in% c(2, 4)])
## 
##  Welch Two Sample t-test
## 
## data:  y[tr %in% c(2, 4)] by tr[tr %in% c(2, 4)]
## t = 5.98, df = 56.41, p-value = 1.6e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.218 4.452
## sample estimates:
## mean in group 2 mean in group 4 
##          4.1851          0.8502

But the user should, of course, we aware of problems with multiple comparisons.

Treatment group summaries

The easiest way to summarize the information underlying an ANOVA procedure is to look at the treatment group means and variances (or standard deviations). Luckily R makes it very easy to calculate this statistic on each group using the by function. If we want the mean of y for each level of tr, we simply call:

by(y, tr, FUN = mean)
## tr: 1
## [1] 5.029
## -------------------------------------------------------- 
## tr: 2
## [1] 4.185
## -------------------------------------------------------- 
## tr: 3
## [1] 3.781
## -------------------------------------------------------- 
## tr: 4
## [1] 0.8502

The result is an output that shows the treatment level and the associated mean. We can also obtain the same information in a slightly different format using tapply:

tapply(y, tr, FUN = mean)
##      1      2      3      4 
## 5.0289 4.1851 3.7806 0.8502

This returns an object of class “table”, which is perhaps easier to work with. We can do the same for the treatment group standard deviations:

tapply(y, tr, FUN = sd)
##     1     2     3     4 
## 0.702 2.334 5.464 1.970

And we could even mind them together:

out <- cbind(tapply(y, tr, FUN = mean), tapply(y, tr, FUN = sd))
colnames(out) <- c("mean", "sd")
out
##     mean    sd
## 1 5.0289 0.702
## 2 4.1851 2.334
## 3 3.7806 5.464
## 4 0.8502 1.970

The result is a nice matrix showing the mean and standard deviation for each group. If there was some other statistic we wanted to calculate for each group, we could easily use by or tapply to obtain it.

Treatment group plots

A perhaps more convenient way to see our data is to plot it. We can use plot to produce a simply scatterplot. And we can use our out matrix to highlight the treatment group means:

plot(y ~ tr, col = rgb(1, 0, 0, 0.5), pch = 16)
# highlight the means:
points(1:4, out[, 1], col = "blue", bg = "blue", pch = 23, cex = 2)

plot of chunk unnamed-chunk-13

This nice because it shows the distribution of the data, but we can also use a boxplot summary to precisely see the locations of points on the distribution. Specifically, a boxplot will draw the five-number summary for each treatment group:

tapply(y, tr, fivenum)
## $`1`
## [1] 3.842 4.562 5.093 5.319 7.310
## 
## $`2`
## [1] -0.5439  2.9554  3.9527  6.1308  7.7949
## 
## $`3`
## [1] -6.3720  0.4349  3.7029  7.1900 16.9098
## 
## $`4`
## [1] -1.9160 -0.5528  0.3361  1.8270  5.8914
boxplot(y ~ tr)

plot of chunk unnamed-chunk-14

Another approach is to use our out object, containing treatment group means and standard deviations to draw a dotchart. We'll first divide our standard deviations by sqrt(30) to convert them to standard errors of the mean.

out[, 2] <- out[, 2]/sqrt(30)
dotchart(out[, 1], xlim = c(0, 6), xlab = "y", main = "Treatment group means", 
    pch = 23, bg = "black")
segments(out[, 1] - out[, 2], 1:4, out[, 1] + out[, 2], 1:4, lwd = 2)
segments(out[, 1] - 2 * out[, 2], 1:4, out[, 1] + 2 * out[, 2], 1:4, lwd = 1)

plot of chunk unnamed-chunk-15

This plot nicely shows the means and both 1- and 2-standard errors of the mean.

Basic Math Tutorial

R enables you to do basic math, using all the usual operators: Addition

2 + 2
## [1] 4
1 + 2 + 3 + 4 + 5
## [1] 15

Subtraction

10 - 1
## [1] 9
5 - 6
## [1] -1

Multiplication

2 * 2
## [1] 4
1 * 2 * 3
## [1] 6

Division

4/2
## [1] 2
10/2/4
## [1] 1.25

Parentheses can be used to adjust order of operations:

10/2 + 2
## [1] 7
10/(2 + 2)
## [1] 2.5
(10/2) + 2
## [1] 7

Check your intuition by checking order of operations on your own functions.

Exponents and square roots involve intuitive syntax:

2^2
## [1] 4
3^4
## [1] 81
1^0
## [1] 1
sqrt(4)
## [1] 2

So do logarithms:

log(0)
## [1] -Inf
log(1)
## [1] 0

and logarithms to other bases, including arbitrary ones:

log10(1)
## [1] 0
log10(10)
## [1] 1
log2(1)
## [1] 0
log2(2)
## [1] 1
logb(1, base = 5)
## [1] 0
logb(5, base = 5)
## [1] 1

natural number exponents use a similar syntax:

exp(0)
## [1] 1
exp(1)
## [1] 2.718

There are also tons of other mathematical operations, like: Absolute value

abs(-10)
## [1] 10

Factorials:

factorial(10)
## [1] 3628800

and Choose:

choose(4, 1)
## [1] 4
choose(6, 3)
## [1] 20

Basic Univariate Statistics

R is obviously a statistical programming language and environment, so we can use it to do statistics. With any vector, we can calculate a number of statistics, including:

set.seed(1)
a <- rnorm(100)

Simple statistics

mininum

min(a)
## [1] -2.215

maximum

max(a)
## [1] 2.402

We can get the minimum and maximum together with range:

range(a)
## [1] -2.215  2.402

We can also obtain the minimum by sorting the vector (using sort):

sort(a)[1]
## [1] -2.215

And we can obtain the maximum by sorting in the opposite order:

sort(a, decreasing = TRUE)[1]
## [1] 2.402

To calculate the central tendency, we have several options. mean

mean(a)
## [1] 0.1089

This is of course equivalent to:

sum(a)/length(a)
## [1] 0.1089

median

median(a)
## [1] 0.1139

In a vector with an even number of elements, this is equivalent to:

(sort(a)[length(a)/2] + sort(a)[length(a)/2 + 1])/2
## [1] 0.1139

In a vector with an odd number of elements, this is equivalent to:

a2 <- a[-1]  #' drop first observation of `a`
sort(a2)[length(a2)/2 + 1]
## [1] 0.1533

We can also obtain measures of dispersion: Variance

var(a)
## [1] 0.8068

This is equivalent to:

sum((a - mean(a))^2)/(length(a) - 1)
## [1] 0.8068

Standard deviation

sd(a)
## [1] 0.8982

Which is equivalent to:

sqrt(var(a))
## [1] 0.8982

Or:

sqrt(sum((a - mean(a))^2)/(length(a) - 1))
## [1] 0.8982

There are also some convenience functions that provide multiple statistics. The fivenum function provides the five-number summary (minimum, Q1, median, Q3, and maximum):

fivenum(a)
## [1] -2.2147 -0.5103  0.1139  0.6934  2.4016

It is also possible to obtain arbitrary percentiles/quantiles from a vector:

quantile(a, 0.1)  #' 10% quantile
##    10% 
## -1.053

You can also specify a vector of quantiles:

quantile(a, c(0.025, 0.975))
##   2.5%  97.5% 
## -1.671  1.797
quantile(a, seq(0, 1, by = 0.1))
##      0%     10%     20%     30%     40%     50%     60%     70%     80% 
## -2.2147 -1.0527 -0.6139 -0.3753 -0.0767  0.1139  0.3771  0.5812  0.7713 
##     90%    100% 
##  1.1811  2.4016

Summary

The summary function, applied to a numeric vector, provides those values and the mean:

summary(a)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -2.210  -0.494   0.114   0.109   0.692   2.400

Note: The summary function returns different results if the vector is a logical, character, or factor. For a logical vector, summary returns some tabulations:

summary(as.logical(rbinom(100, 1, 0.5)))
##    Mode   FALSE    TRUE    NA's 
## logical      62      38       0

For a character vector, summary returns just some basic information about the vector:

summary(sample(c("a", "b", "c"), 100, TRUE))
##    Length     Class      Mode 
##       100 character character

For a factor, summary returns a table of all values in the vector:

summary(factor(a))
##    -2.2146998871775   -1.98935169586337   -1.80495862889104 
##                   1                   1                   1 
##   -1.52356680042976   -1.47075238389927   -1.37705955682861 
##                   1                   1                   1 
##   -1.27659220845804    -1.2536334002391   -1.22461261489836 
##                   1                   1                   1 
##   -1.12936309608079   -1.04413462631653  -0.934097631644252 
##                   1                   1                   1 
##  -0.835628612410047  -0.820468384118015  -0.743273208882405 
##                   1                   1                   1 
##  -0.709946430921815   -0.70749515696212   -0.68875569454952 
##                   1                   1                   1 
##  -0.626453810742332  -0.621240580541804  -0.612026393250771 
##                   1                   1                   1 
##  -0.589520946188072  -0.573265414236886  -0.568668732818502 
##                   1                   1                   1 
##   -0.54252003099165   -0.47815005510862  -0.473400636439312 
##                   1                   1                   1 
##  -0.443291873218433   -0.41499456329968  -0.394289953710349 
##                   1                   1                   1 
##  -0.367221476466509  -0.305388387156356  -0.304183923634301 
##                   1                   1                   1 
##  -0.253361680136508  -0.164523596253587  -0.155795506705329 
##                   1                   1                   1 
##  -0.135178615123832  -0.135054603880824  -0.112346212150228 
##                   1                   1                   1 
##  -0.102787727342996 -0.0593133967111857 -0.0561287395290008 
##                   1                   1                   1 
## -0.0538050405829051 -0.0449336090152309 -0.0392400027331692 
##                   1                   1                   1 
## -0.0161902630989461 0.00110535163162413  0.0280021587806661 
##                   1                   1                   1 
##  0.0743413241516641  0.0745649833651906   0.153253338211898 
##                   1                   1                   1 
##   0.183643324222082   0.188792299514343   0.267098790772231 
##                   1                   1                   1 
##   0.291446235517463   0.329507771815361   0.332950371213518 
##                   1                   1                   1 
##   0.341119691424425    0.36458196213683   0.370018809916288 
##                   1                   1                   1 
##   0.387671611559369   0.389843236411431   0.398105880367068 
##                   1                   1                   1 
##   0.417941560199702   0.475509528899663   0.487429052428485 
##                   1                   1                   1 
##   0.556663198673657   0.558486425565304   0.569719627442413 
##                   1                   1                   1 
##   0.575781351653492   0.593901321217509   0.593946187628422 
##                   1                   1                   1 
##   0.610726353489055    0.61982574789471   0.689739362450777 
##                   1                   1                   1 
##   0.696963375404737   0.700213649514998   0.738324705129217 
##                   1                   1                   1 
##   0.763175748457544   0.768532924515416   0.782136300731067 
##                   1                   1                   1 
##   0.821221195098089   0.881107726454215   0.918977371608218 
##                   1                   1                   1 
##   0.943836210685299    1.06309983727636    1.10002537198388 
##                   1                   1                   1 
##    1.12493091814311    1.16040261569495     1.1780869965732 
##                   1                   1                   1 
##    1.20786780598317    1.35867955152904    1.43302370170104 
##                   1                   1                   1 
##    1.46555486156289    1.51178116845085    1.58683345454085 
##                   1                   1                   1 
##    1.59528080213779    1.98039989850586    2.17261167036215 
##                   1                   1                   1 
##    2.40161776050478 
##                   1

A summary of a dataframe will return the summary information separate for each column vector. This may look produce different result for each column, depending on the class of the column:

summary(data.frame(a = 1:10, b = 11:20))
##        a               b       
##  Min.   : 1.00   Min.   :11.0  
##  1st Qu.: 3.25   1st Qu.:13.2  
##  Median : 5.50   Median :15.5  
##  Mean   : 5.50   Mean   :15.5  
##  3rd Qu.: 7.75   3rd Qu.:17.8  
##  Max.   :10.00   Max.   :20.0
summary(data.frame(a = 1:10, b = factor(11:20)))
##        a               b    
##  Min.   : 1.00   11     :1  
##  1st Qu.: 3.25   12     :1  
##  Median : 5.50   13     :1  
##  Mean   : 5.50   14     :1  
##  3rd Qu.: 7.75   15     :1  
##  Max.   :10.00   16     :1  
##                  (Other):4

A summary of a list will return not very useful information:

summary(list(a = 1:10, b = 1:10))
##   Length Class  Mode   
## a 10     -none- numeric
## b 10     -none- numeric

A summary of a matrix returns a summary of each column separately (like a dataframe):

summary(matrix(1:20, nrow = 4))
##        V1             V2             V3              V4      
##  Min.   :1.00   Min.   :5.00   Min.   : 9.00   Min.   :13.0  
##  1st Qu.:1.75   1st Qu.:5.75   1st Qu.: 9.75   1st Qu.:13.8  
##  Median :2.50   Median :6.50   Median :10.50   Median :14.5  
##  Mean   :2.50   Mean   :6.50   Mean   :10.50   Mean   :14.5  
##  3rd Qu.:3.25   3rd Qu.:7.25   3rd Qu.:11.25   3rd Qu.:15.2  
##  Max.   :4.00   Max.   :8.00   Max.   :12.00   Max.   :16.0  
##        V5      
##  Min.   :17.0  
##  1st Qu.:17.8  
##  Median :18.5  
##  Mean   :18.5  
##  3rd Qu.:19.2  
##  Max.   :20.0

Binary Outcome GLM Effects Plots

This tutorial draws aims at making various binary outcome GLM models interpretable through the use of plots. As such, it begins by setting up some data (involving a few covariates) and then generating various versions of an outcome based upon data-generating proceses with and without interaction. The aim of the tutorial is to both highlight the use of predicted probability plots for demonstrating effects and demonstrate the challenge - even then - of clearly communicating the results of these types of models.

Let's begin by generating our covariates:

set.seed(1)
n <- 200
x1 <- rbinom(n, 1, 0.5)
x2 <- runif(n, 0, 1)
x3 <- runif(n, 0, 5)

Now, we'll build several models. Each model has an outcome that is a transformed linear function the covariates (i.e., we calculate a y variable that is a linear function of the covariates, then rescale that outcome [0,1], and use the rescaled version as a probability model in generating draws from a binomial distribution).

# Simple multivariate model (no interaction):
y1 <- 2 * x1 + 5 * x2 + rnorm(n, 0, 3)
y1s <- rbinom(n, 1, (y1 - min(y1))/(max(y1) - min(y1)))  # the math here is just to rescale to [0,1]
# Simple multivariate model (with interaction):
y2 <- 2 * x1 + 5 * x2 + 2 * x1 * x2 + rnorm(n, 0, 3)
y2s <- rbinom(n, 1, (y2 - min(y2))/(max(y2) - min(y2)))
# Simple multivariate model (with interaction and an extra term):
y3 <- 2 * x1 + 5 * x2 + 2 * x1 * x2 + x3 + rnorm(n, 0, 3)
y3s <- rbinom(n, 1, (y2 - min(y2))/(max(y2) - min(y2)))

We thus have three outcomes (y1s, y2s, and y3s) that are binary outcomes, but each is constructed as a slightly different function of our three covariates. We can then build models of each outcome. We'll build two versions of y2s and y3s (one version a that does not model the interaction and another version b that does model it):

m1 <- glm(y1s ~ x1 + x2, family = binomial(link = "probit"))
m2a <- glm(y2s ~ x1 + x2, family = binomial(link = "probit"))
m2b <- glm(y2s ~ x1 * x2, family = binomial(link = "probit"))
m3a <- glm(y1s ~ x1 + x2 + x3, family = binomial(link = "probit"))
m3b <- glm(y1s ~ x1 * x2 + x3, family = binomial(link = "probit"))

We can look at the outcome of one of our models, e.g. m3b (for y3s modelled with an interaction), but we know that the coefficients are not directly interpretable:

summary(m3b)
## 
## Call:
## glm(formula = y1s ~ x1 * x2 + x3, family = binomial(link = "probit"))
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.554  -1.141   0.873   1.011   1.365  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -0.1591     0.2769   -0.57     0.57
## x1            0.4963     0.3496    1.42     0.16
## x2            0.3212     0.4468    0.72     0.47
## x3           -0.0315     0.0625   -0.50     0.61
## x1:x2        -0.0794     0.6429   -0.12     0.90
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 274.83  on 199  degrees of freedom
## Residual deviance: 266.59  on 195  degrees of freedom
## AIC: 276.6
## 
## Number of Fisher Scoring iterations: 4

Instead we need to look at fitted values (specifically, the predicted probability of observing y==1 in each model. We can see these fitted values for our actual data using the predict function:

p3b.fitted <- predict(m3b, type = "response", se.fit = TRUE)
p3b.fitted
## $fit
##      1      2      3      4      5      6      7      8      9     10 
## 0.4298 0.4530 0.6225 0.6029 0.4015 0.6364 0.6609 0.5970 0.6545 0.4695 
##     11     12     13     14     15     16     17     18     19     20 
## 0.4973 0.4273 0.6569 0.5082 0.6642 0.4463 0.6614 0.6982 0.5151 0.6141 
##     21     22     23     24     25     26     27     28     29     30 
## 0.6302 0.4258 0.6220 0.4929 0.5332 0.4765 0.4642 0.3856 0.6203 0.4908 
##     31     32     33     34     35     36     37     38     39     40 
## 0.4228 0.6397 0.4608 0.4837 0.6609 0.6472 0.6508 0.5043 0.6425 0.4370 
##     41     42     43     44     45     46     47     48     49     50 
## 0.6572 0.6667 0.6839 0.6086 0.6539 0.6261 0.4685 0.3956 0.6660 0.7103 
##     51     52     53     54     55     56     57     58     59     60 
## 0.4737 0.6890 0.4736 0.5032 0.4953 0.4100 0.4323 0.6188 0.6302 0.5521 
##     61     62     63     64     65     66     67     68     69     70 
## 0.6694 0.4475 0.4720 0.5097 0.6794 0.3998 0.4009 0.6893 0.4861 0.6063 
##     71     72     73     74     75     76     77     78     79     80 
## 0.4162 0.5845 0.4409 0.4336 0.4674 0.6353 0.6039 0.4327 0.6459 0.6608 
##     81     82     83     84     85     86     87     88     89     90 
## 0.3937 0.6701 0.4960 0.4253 0.6011 0.4232 0.6499 0.4563 0.3995 0.4565 
##     91     92     93     94     95     96     97     98     99    100 
## 0.4641 0.3956 0.7014 0.6519 0.6543 0.6663 0.4225 0.4199 0.5856 0.7099 
##    101    102    103    104    105    106    107    108    109    110 
## 0.6602 0.4064 0.4584 0.6347 0.6382 0.5027 0.4342 0.4874 0.5928 0.6369 
##    111    112    113    114    115    116    117    118    119    120 
## 0.5924 0.6219 0.5437 0.4831 0.4262 0.4882 0.6009 0.4480 0.5396 0.6509 
##    121    122    123    124    125    126    127    128    129    130 
## 0.6434 0.4324 0.4667 0.5559 0.6854 0.5004 0.6580 0.4891 0.4115 0.6425 
##    131    132    133    134    135    136    137    138    139    140 
## 0.6845 0.4723 0.4666 0.6879 0.6321 0.6621 0.6502 0.6263 0.6739 0.6822 
##    141    142    143    144    145    146    147    148    149    150 
## 0.6963 0.5908 0.4650 0.4652 0.6807 0.4638 0.4344 0.6252 0.4487 0.6544 
##    151    152    153    154    155    156    157    158    159    160 
## 0.6247 0.6036 0.5363 0.4336 0.6748 0.4489 0.7010 0.4441 0.4986 0.4451 
##    161    162    163    164    165    166    167    168    169    170 
## 0.4102 0.6543 0.5224 0.6137 0.6317 0.5000 0.4308 0.4861 0.6780 0.4361 
##    171    172    173    174    175    176    177    178    179    180 
## 0.6620 0.6581 0.6992 0.4602 0.4137 0.6391 0.6228 0.6851 0.5977 0.6332 
##    181    182    183    184    185    186    187    188    189    190 
## 0.4698 0.4494 0.6532 0.7151 0.6177 0.4415 0.6977 0.6755 0.5844 0.6767 
##    191    192    193    194    195    196    197    198    199    200 
## 0.6125 0.4399 0.5062 0.6395 0.4052 0.6253 0.4909 0.6311 0.4578 0.6187 
## 
## $se.fit
##       1       2       3       4       5       6       7       8       9 
## 0.06090 0.07493 0.07213 0.07595 0.08505 0.05446 0.05024 0.08497 0.08736 
##      10      11      12      13      14      15      16      17      18 
## 0.08458 0.11678 0.07902 0.07268 0.10658 0.07666 0.05544 0.05338 0.08647 
##      19      20      21      22      23      24      25      26      27 
## 0.10526 0.06544 0.06385 0.06895 0.05878 0.07211 0.10434 0.05489 0.07976 
##      28      29      30      31      32      33      34      35      36 
## 0.09837 0.06283 0.09679 0.07274 0.09728 0.05479 0.06119 0.06930 0.06898 
##      37      38      39      40      41      42      43      44      45 
## 0.06506 0.08250 0.04903 0.06905 0.08040 0.05357 0.08155 0.07871 0.05722 
##      46      47      48      49      50      51      52      53      54 
## 0.07044 0.05457 0.08967 0.06453 0.09121 0.09065 0.08373 0.05496 0.07556 
##      55      56      57      58      59      60      61      62      63 
## 0.07928 0.07808 0.06067 0.07489 0.05318 0.12015 0.06053 0.05486 0.08607 
##      64      65      66      67      68      69      70      71      72 
## 0.08180 0.06173 0.08619 0.08555 0.07004 0.06089 0.07659 0.08207 0.09648 
##      73      74      75      76      77      78      79      80      81 
## 0.05605 0.06985 0.07491 0.07977 0.07469 0.06774 0.04755 0.07062 0.09192 
##      82      83      84      85      86      87      88      89      90 
## 0.06098 0.09897 0.07266 0.09254 0.07215 0.06856 0.09395 0.08548 0.07833 
##      91      92      93      94      95      96      97      98      99 
## 0.08700 0.08957 0.09002 0.06974 0.04870 0.05586 0.07756 0.07460 0.09843 
##     100     101     102     103     104     105     106     107     108 
## 0.09045 0.05616 0.08076 0.05338 0.05133 0.05233 0.11964 0.06895 0.09019 
##     109     110     111     112     113     114     115     116     117 
## 0.09251 0.05045 0.08785 0.07194 0.11253 0.07388 0.06328 0.06781 0.08778 
##     118     119     120     121     122     123     124     125     126 
## 0.05248 0.11138 0.05489 0.06888 0.05946 0.05557 0.12468 0.07418 0.11247 
##     127     128     129     130     131     132     133     134     135 
## 0.08133 0.08255 0.07904 0.09024 0.09377 0.08286 0.05628 0.07110 0.10181 
##     136     137     138     139     140     141     142     143     144 
## 0.07580 0.05096 0.08026 0.06026 0.09215 0.09064 0.08885 0.05537 0.05696 
##     145     146     147     148     149     150     151     152     153 
## 0.06283 0.07662 0.07562 0.07680 0.05194 0.06802 0.06453 0.09005 0.10586 
##     154     155     156     157     158     159     160     161     162 
## 0.06816 0.09976 0.05952 0.08116 0.09688 0.09872 0.06212 0.07796 0.07138 
##     163     164     165     166     167     168     169     170     171 
## 0.09198 0.06561 0.07026 0.08818 0.08680 0.06974 0.06071 0.07222 0.06292 
##     172     173     174     175     176     177     178     179     180 
## 0.04906 0.08438 0.08074 0.07343 0.04938 0.05798 0.07527 0.08185 0.05499 
##     181     182     183     184     185     186     187     188     189 
## 0.05549 0.07398 0.05759 0.09607 0.06677 0.06207 0.07768 0.07232 0.09640 
##     190     191     192     193     194     195     196     197     198 
## 0.09458 0.07770 0.07854 0.10283 0.05069 0.08024 0.05617 0.06491 0.05762 
##     199     200 
## 0.05557 0.07847 
## 
## $residual.scale
## [1] 1

We can even draw a small plot showing the predicted values separately for levels of x1 (recall that x1 is a binary/indicator variable):

plot(NA, xlim = c(0, 1), ylim = c(0, 1), xlab = "x2", ylab = "Predicted Probability of y=1")
points(x2[x1 == 0], p3b.fitted$fit[x1 == 0], col = rgb(1, 0, 0, 0.5))
points(x2[x1 == 1], p3b.fitted$fit[x1 == 1], col = rgb(0, 0, 1, 0.5))

plot of chunk unnamed-chunk-6

But this graph doesn't show the fit of the model to all values of x1 and x2 (or x3) and doesn't communicate any of our uncertainty.

Predicted Probability Plots

To get a better grasp on our models, we'll create some fake data representing the full scales of x1, x2, and x3:

newdata1 <- expand.grid(x1 = 0:1, x2 = seq(0, 1, length.out = 10))
newdata2 <- expand.grid(x1 = 0:1, x2 = seq(0, 1, length.out = 10), x3 = seq(0, 
    5, length.out = 25))

We can then use these new fake data to generate predicted probabilities of each outcome at each combination of covarites:

p1 <- predict(m1, newdata1, type = "response", se.fit = TRUE)
p2a <- predict(m2a, newdata1, type = "response", se.fit = TRUE)
p2b <- predict(m2b, newdata1, type = "response", se.fit = TRUE)
p3a <- predict(m3a, newdata2, type = "response", se.fit = TRUE)
p3b <- predict(m3b, newdata2, type = "response", se.fit = TRUE)

We can look at one of these objects, e.g. p3b, to see that we have predicted probabilities and associated standard errors:

p3b
## $fit
##      1      2      3      4      5      6      7      8      9     10 
## 0.4368 0.6320 0.4509 0.6421 0.4650 0.6521 0.4792 0.6619 0.4935 0.6717 
##     11     12     13     14     15     16     17     18     19     20 
## 0.5077 0.6814 0.5219 0.6909 0.5361 0.7003 0.5503 0.7096 0.5644 0.7187 
##     21     22     23     24     25     26     27     28     29     30 
## 0.4342 0.6295 0.4483 0.6396 0.4624 0.6496 0.4766 0.6595 0.4909 0.6693 
##     31     32     33     34     35     36     37     38     39     40 
## 0.5051 0.6790 0.5193 0.6886 0.5335 0.6980 0.5477 0.7073 0.5618 0.7165 
##     41     42     43     44     45     46     47     48     49     50 
## 0.4316 0.6271 0.4457 0.6372 0.4598 0.6472 0.4740 0.6571 0.4882 0.6670 
##     51     52     53     54     55     56     57     58     59     60 
## 0.5025 0.6767 0.5167 0.6863 0.5309 0.6957 0.5451 0.7051 0.5592 0.7143 
##     61     62     63     64     65     66     67     68     69     70 
## 0.4291 0.6246 0.4431 0.6347 0.4572 0.6448 0.4714 0.6547 0.4856 0.6646 
##     71     72     73     74     75     76     77     78     79     80 
## 0.4999 0.6743 0.5141 0.6839 0.5283 0.6934 0.5425 0.7028 0.5566 0.7120 
##     81     82     83     84     85     86     87     88     89     90 
## 0.4265 0.6221 0.4405 0.6323 0.4546 0.6423 0.4688 0.6523 0.4830 0.6622 
##     91     92     93     94     95     96     97     98     99    100 
## 0.4972 0.6719 0.5115 0.6816 0.5257 0.6911 0.5399 0.7005 0.5540 0.7098 
##    101    102    103    104    105    106    107    108    109    110 
## 0.4239 0.6196 0.4379 0.6298 0.4520 0.6399 0.4662 0.6499 0.4804 0.6598 
##    111    112    113    114    115    116    117    118    119    120 
## 0.4946 0.6696 0.5089 0.6792 0.5231 0.6888 0.5373 0.6982 0.5514 0.7075 
##    121    122    123    124    125    126    127    128    129    130 
## 0.4213 0.6171 0.4354 0.6273 0.4494 0.6374 0.4636 0.6475 0.4778 0.6574 
##    131    132    133    134    135    136    137    138    139    140 
## 0.4920 0.6672 0.5062 0.6769 0.5205 0.6865 0.5347 0.6959 0.5488 0.7053 
##    141    142    143    144    145    146    147    148    149    150 
## 0.4188 0.6146 0.4328 0.6248 0.4468 0.6350 0.4610 0.6450 0.4752 0.6550 
##    151    152    153    154    155    156    157    158    159    160 
## 0.4894 0.6648 0.5036 0.6745 0.5179 0.6842 0.5321 0.6936 0.5462 0.7030 
##    161    162    163    164    165    166    167    168    169    170 
## 0.4162 0.6121 0.4302 0.6223 0.4443 0.6325 0.4584 0.6426 0.4726 0.6525 
##    171    172    173    174    175    176    177    178    179    180 
## 0.4868 0.6624 0.5010 0.6722 0.5153 0.6818 0.5295 0.6913 0.5436 0.7007 
##    181    182    183    184    185    186    187    188    189    190 
## 0.4137 0.6096 0.4276 0.6198 0.4417 0.6300 0.4558 0.6401 0.4700 0.6501 
##    191    192    193    194    195    196    197    198    199    200 
## 0.4842 0.6600 0.4984 0.6698 0.5126 0.6795 0.5269 0.6890 0.5410 0.6985 
##    201    202    203    204    205    206    207    208    209    210 
## 0.4111 0.6071 0.4251 0.6173 0.4391 0.6275 0.4532 0.6377 0.4674 0.6477 
##    211    212    213    214    215    216    217    218    219    220 
## 0.4816 0.6576 0.4958 0.6674 0.5100 0.6771 0.5242 0.6867 0.5384 0.6962 
##    221    222    223    224    225    226    227    228    229    230 
## 0.4086 0.6045 0.4225 0.6148 0.4365 0.6251 0.4506 0.6352 0.4647 0.6453 
##    231    232    233    234    235    236    237    238    239    240 
## 0.4789 0.6552 0.4932 0.6650 0.5074 0.6748 0.5216 0.6844 0.5358 0.6939 
##    241    242    243    244    245    246    247    248    249    250 
## 0.4060 0.6020 0.4199 0.6123 0.4339 0.6226 0.4480 0.6327 0.4621 0.6428 
##    251    252    253    254    255    256    257    258    259    260 
## 0.4763 0.6528 0.4906 0.6627 0.5048 0.6724 0.5190 0.6821 0.5332 0.6916 
##    261    262    263    264    265    266    267    268    269    270 
## 0.4035 0.5995 0.4174 0.6098 0.4313 0.6201 0.4454 0.6303 0.4595 0.6404 
##    271    272    273    274    275    276    277    278    279    280 
## 0.4737 0.6504 0.4879 0.6603 0.5022 0.6700 0.5164 0.6797 0.5306 0.6893 
##    281    282    283    284    285    286    287    288    289    290 
## 0.4010 0.5969 0.4148 0.6073 0.4288 0.6176 0.4428 0.6278 0.4569 0.6379 
##    291    292    293    294    295    296    297    298    299    300 
## 0.4711 0.6479 0.4853 0.6579 0.4996 0.6677 0.5138 0.6774 0.5280 0.6869 
##    301    302    303    304    305    306    307    308    309    310 
## 0.3984 0.5944 0.4123 0.6048 0.4262 0.6151 0.4402 0.6253 0.4543 0.6354 
##    311    312    313    314    315    316    317    318    319    320 
## 0.4685 0.6455 0.4827 0.6554 0.4970 0.6653 0.5112 0.6750 0.5254 0.6846 
##    321    322    323    324    325    326    327    328    329    330 
## 0.3959 0.5919 0.4097 0.6023 0.4236 0.6126 0.4376 0.6228 0.4517 0.6330 
##    331    332    333    334    335    336    337    338    339    340 
## 0.4659 0.6431 0.4801 0.6530 0.4943 0.6629 0.5086 0.6726 0.5228 0.6823 
##    341    342    343    344    345    346    347    348    349    350 
## 0.3934 0.5893 0.4072 0.5997 0.4211 0.6101 0.4351 0.6203 0.4492 0.6305 
##    351    352    353    354    355    356    357    358    359    360 
## 0.4633 0.6406 0.4775 0.6506 0.4917 0.6605 0.5060 0.6703 0.5202 0.6799 
##    361    362    363    364    365    366    367    368    369    370 
## 0.3909 0.5868 0.4046 0.5972 0.4185 0.6075 0.4325 0.6178 0.4466 0.6280 
##    371    372    373    374    375    376    377    378    379    380 
## 0.4607 0.6382 0.4749 0.6482 0.4891 0.6581 0.5033 0.6679 0.5176 0.6776 
##    381    382    383    384    385    386    387    388    389    390 
## 0.3883 0.5842 0.4021 0.5946 0.4159 0.6050 0.4299 0.6153 0.4440 0.6256 
##    391    392    393    394    395    396    397    398    399    400 
## 0.4581 0.6357 0.4723 0.6457 0.4865 0.6557 0.5007 0.6655 0.5150 0.6752 
##    401    402    403    404    405    406    407    408    409    410 
## 0.3858 0.5816 0.3995 0.5921 0.4134 0.6025 0.4273 0.6128 0.4414 0.6231 
##    411    412    413    414    415    416    417    418    419    420 
## 0.4555 0.6332 0.4697 0.6433 0.4839 0.6533 0.4981 0.6631 0.5123 0.6729 
##    421    422    423    424    425    426    427    428    429    430 
## 0.3833 0.5791 0.3970 0.5896 0.4108 0.6000 0.4248 0.6103 0.4388 0.6206 
##    431    432    433    434    435    436    437    438    439    440 
## 0.4529 0.6308 0.4671 0.6408 0.4813 0.6508 0.4955 0.6607 0.5097 0.6705 
##    441    442    443    444    445    446    447    448    449    450 
## 0.3808 0.5765 0.3945 0.5870 0.4083 0.5974 0.4222 0.6078 0.4362 0.6181 
##    451    452    453    454    455    456    457    458    459    460 
## 0.4503 0.6283 0.4645 0.6384 0.4787 0.6484 0.4929 0.6583 0.5071 0.6681 
##    461    462    463    464    465    466    467    468    469    470 
## 0.3783 0.5739 0.3920 0.5845 0.4057 0.5949 0.4196 0.6053 0.4336 0.6156 
##    471    472    473    474    475    476    477    478    479    480 
## 0.4477 0.6258 0.4619 0.6359 0.4760 0.6460 0.4903 0.6559 0.5045 0.6657 
##    481    482    483    484    485    486    487    488    489    490 
## 0.3758 0.5714 0.3895 0.5819 0.4032 0.5924 0.4171 0.6027 0.4311 0.6131 
##    491    492    493    494    495    496    497    498    499    500 
## 0.4451 0.6233 0.4593 0.6335 0.4734 0.6435 0.4877 0.6535 0.5019 0.6634 
## 
## $se.fit
##       1       2       3       4       5       6       7       8       9 
## 0.10906 0.11971 0.09785 0.10469 0.08926 0.09169 0.08427 0.08147 0.08364 
##      10      11      12      13      14      15      16      17      18 
## 0.07488 0.08750 0.07263 0.09527 0.07477 0.10602 0.08065 0.11883 0.08923 
##      19      20      21      22      23      24      25      26      27 
## 0.13296 0.09953 0.10626 0.11718 0.09465 0.10190 0.08564 0.08862 0.08033 
##      28      29      30      31      32      33      34      35      36 
## 0.07816 0.07956 0.07148 0.08352 0.06934 0.09157 0.07183 0.10267 0.07818 
##      37      38      39      40      41      42      43      44      45 
## 0.11584 0.08725 0.13030 0.09800 0.10365 0.11478 0.09163 0.09923 0.08220 
##      46      47      48      49      50      51      52      53      54 
## 0.08568 0.07653 0.07498 0.07562 0.06818 0.07968 0.06617 0.08801 0.06903 
##      55      56      57      58      59      60      61      62      63 
## 0.09947 0.07586 0.11299 0.08542 0.12779 0.09661 0.10123 0.11251 0.08881 
##      64      65      66      67      68      69      70      71      72 
## 0.09671 0.07895 0.08288 0.07291 0.07193 0.07184 0.06502 0.07599 0.06315 
##      73      74      75      76      77      78      79      80      81 
## 0.08461 0.06638 0.09642 0.07372 0.11030 0.08376 0.12542 0.09538 0.09903 
##      82      83      84      85      86      87      88      89      90 
## 0.11040 0.08621 0.09435 0.07591 0.08025 0.06949 0.06905 0.06824 0.06203 
##      91      92      93      94      95      96      97      98      99 
## 0.07249 0.06030 0.08139 0.06393 0.09355 0.07176 0.10777 0.08229 0.12320 
##     100     101     102     103     104     105     106     107     108 
## 0.09432 0.09705 0.10845 0.08386 0.09217 0.07312 0.07780 0.06631 0.06636 
##     109     110     111     112     113     114     115     116     117 
## 0.06485 0.05922 0.06920 0.05766 0.07838 0.06170 0.09088 0.07003 0.10543 
##     118     119     120     121     122     123     124     125     126 
## 0.08102 0.12115 0.09344 0.09530 0.10667 0.08176 0.09017 0.07060 0.07556 
##     127     128     129     130     131     132     133     134     135 
## 0.06339 0.06388 0.06173 0.05665 0.06614 0.05525 0.07560 0.05971 0.08843 
##     136     137     138     139     140     141     142     143     144 
## 0.06853 0.10328 0.07997 0.11927 0.09276 0.09381 0.10508 0.07994 0.08839 
##     145     146     147     148     149     150     151     152     153 
## 0.06838 0.07355 0.06077 0.06166 0.05889 0.05435 0.06337 0.05313 0.07307 
##     154     155     156     157     158     159     160     161     162 
## 0.05801 0.08621 0.06730 0.10134 0.07914 0.11758 0.09227 0.09257 0.10369 
##     163     164     165     166     167     168     169     170     171 
## 0.07842 0.08683 0.06650 0.07179 0.05850 0.05973 0.05639 0.05236 0.06091 
##     172     173     174     175     176     177     178     179     180 
## 0.05134 0.07084 0.05662 0.08424 0.06635 0.09963 0.07856 0.11608 0.09198 
##     181     182     183     184     185     186     187     188     189 
## 0.09160 0.10251 0.07722 0.08551 0.06497 0.07032 0.05662 0.05811 0.05427 
##     190     191     192     193     194     195     196     197     198 
## 0.05072 0.05881 0.04992 0.06892 0.05558 0.08255 0.06570 0.09815 0.07823 
##     199     200     201     202     203     204     205     206     207 
## 0.11479 0.09191 0.09091 0.10154 0.07633 0.08445 0.06382 0.06915 0.05515 
##     208     209     210     211     212     213     214     215     216 
## 0.05686 0.05259 0.04949 0.05711 0.04891 0.06735 0.05491 0.08115 0.06536 
##     217     218     219     220     221     222     223     224     225 
## 0.09691 0.07817 0.11370 0.09206 0.09049 0.10081 0.07578 0.08366 0.06306 
##     226     227     228     229     230     231     232     233     234 
## 0.06831 0.05415 0.05599 0.05137 0.04869 0.05583 0.04833 0.06615 0.05464 
##     235     236     237     238     239     240     241     242     243 
## 0.08006 0.06536 0.09594 0.07837 0.11284 0.09243 0.09035 0.10031 0.07557 
##     244     245     246     247     248     249     250     251     252 
## 0.08315 0.06272 0.06780 0.05362 0.05553 0.05065 0.04836 0.05502 0.04823 
##     253     254     255     256     257     258     259     260     261 
## 0.06533 0.05477 0.07929 0.06568 0.09524 0.07885 0.11220 0.09303 0.09049 
##     262     263     264     265     266     267     268     269     270 
## 0.10005 0.07570 0.08293 0.06280 0.06765 0.05358 0.05549 0.05045 0.04852 
##     271     272     273     274     275     276     277     278     279 
## 0.05468 0.04860 0.06492 0.05532 0.07886 0.06634 0.09480 0.07959 0.11180 
##     280     281     282     283     284     285     286     287     288 
## 0.09384 0.09091 0.10004 0.07617 0.08301 0.06328 0.06785 0.05403 0.05589 
##     289     290     291     292     293     294     295     296     297 
## 0.05078 0.04916 0.05483 0.04944 0.06492 0.05627 0.07876 0.06733 0.09465 
##     298     299     300     301     302     303     304     305     306 
## 0.08061 0.11162 0.09488 0.09159 0.10028 0.07695 0.08338 0.06417 0.06842 
##     307     308     309     310     311     312     313     314     315 
## 0.05496 0.05671 0.05163 0.05027 0.05547 0.05074 0.06533 0.05761 0.07900 
##     316     317     318     319     320     321     322     323     324 
## 0.06864 0.09478 0.08188 0.11168 0.09614 0.09252 0.10077 0.07806 0.08406 
##     325     326     327     328     329     330     331     332     333 
## 0.06543 0.06934 0.05633 0.05796 0.05295 0.05183 0.05657 0.05247 0.06615 
##     334     335     336     337     338     339     340     341     342 
## 0.05932 0.07957 0.07026 0.09519 0.08342 0.11198 0.09762 0.09371 0.10151 
##     343     344     345     346     347     348     349     350     351 
## 0.07945 0.08502 0.06705 0.07061 0.05812 0.05959 0.05473 0.05380 0.05811 
##     352     353     354     355     356     357     358     359     360 
## 0.05459 0.06735 0.06138 0.08048 0.07218 0.09587 0.08519 0.11251 0.09930 
##     361     362     363     364     365     366     367     368     369 
## 0.09513 0.10250 0.08113 0.08628 0.06900 0.07221 0.06028 0.06160 0.05691 
##     370     371     372     373     374     375     376     377     378 
## 0.05616 0.06004 0.05707 0.06891 0.06375 0.08170 0.07436 0.09682 0.08721 
##     379     380     381     382     383     384     385     386     387 
## 0.11328 0.10118 0.09678 0.10372 0.08306 0.08781 0.07124 0.07412 0.06277 
##     388     389     390     391     392     393     394     395     396 
## 0.06394 0.05945 0.05885 0.06234 0.05987 0.07082 0.06642 0.08322 0.07681 
##     397     398     399     400     401     402     403     404     405 
## 0.09804 0.08945 0.11426 0.10326 0.09863 0.10518 0.08524 0.08960 0.07374 
##     406     407     408     409     410     411     412     413     414 
## 0.07633 0.06555 0.06659 0.06230 0.06184 0.06496 0.06294 0.07304 0.06934 
##     415     416     417     418     419     420     421     422     423 
## 0.08503 0.07949 0.09951 0.09189 0.11548 0.10552 0.10067 0.10687 0.08762 
##     424     425     426     427     428     429     430     431     432 
## 0.09164 0.07649 0.07880 0.06858 0.06951 0.06541 0.06508 0.06787 0.06626 
##     433     434     435     436     437     438     439     440     441 
## 0.07554 0.07249 0.08710 0.08238 0.10122 0.09454 0.11690 0.10797 0.10289 
##     442     443     444     445     446     447     448     449     450 
## 0.10877 0.09021 0.09393 0.07944 0.08152 0.07183 0.07267 0.06875 0.06856 
##     451     452     453     454     455     456     457     458     459 
## 0.07101 0.06979 0.07829 0.07585 0.08942 0.08548 0.10316 0.09737 0.11853 
##     460     461     462     463     464     465     466     467     468 
## 0.11058 0.10528 0.11087 0.09297 0.09643 0.08258 0.08447 0.07527 0.07605 
##     469     470     471     472     473     474     475     476     477 
## 0.07228 0.07223 0.07437 0.07351 0.08127 0.07940 0.09197 0.08875 0.10531 
##     478     479     480     481     482     483     484     485     486 
## 0.10038 0.12036 0.11335 0.10782 0.11318 0.09588 0.09914 0.08587 0.08763 
##     487     488     489     490     491     492     493     494     495 
## 0.07886 0.07963 0.07598 0.07608 0.07791 0.07739 0.08445 0.08311 0.09473 
##     496     497     498     499     500 
## 0.09219 0.10767 0.10354 0.12238 0.11627 
## 
## $residual.scale
## [1] 1

It is then relatively straight forward to plot the predicted probabilities for all of our data. We'll start with the simple models, then look at the models with interactions and the additional covariate x3.

Simple, no-interaction model

plot(NA, xlim = c(0, 1), ylim = c(0, 1), xlab = "x2", ylab = "Predicted Probability of y=1")
# `x1==0`
lines(newdata1$x2[newdata1$x1 == 0], p1$fit[newdata1$x1 == 0], col = "red")
lines(newdata1$x2[newdata1$x1 == 0], p1$fit[newdata1$x1 == 0] + 1.96 * p1$se.fit[newdata1$x1 == 
    0], col = "red", lty = 2)
lines(newdata1$x2[newdata1$x1 == 0], p1$fit[newdata1$x1 == 0] - 1.96 * p1$se.fit[newdata1$x1 == 
    0], col = "red", lty = 2)
# `x1==1`
lines(newdata1$x2[newdata1$x1 == 1], p1$fit[newdata1$x1 == 1], col = "blue")
lines(newdata1$x2[newdata1$x1 == 1], p1$fit[newdata1$x1 == 1] + 1.96 * p1$se.fit[newdata1$x1 == 
    1], col = "blue", lty = 2)
lines(newdata1$x2[newdata1$x1 == 1], p1$fit[newdata1$x1 == 1] - 1.96 * p1$se.fit[newdata1$x1 == 
    1], col = "blue", lty = 2)

plot of chunk unnamed-chunk-10

The above plot shows two predicted probability curves with heavily overlapping confidence bands. While the effect of x2 is clearly different from zero for both x1==0 and x1==1, the difference between the two curves is not significant. But this model is based on data with no underlying interaction. Let's look next at the outcome that is a function of an interaction between covariates.

Interaction model

Recall that the interaction model (with outcome y2s) was estimated in two different ways. The first estimated model did not account for the interaction, while the second estimated model did account for the interaction. Let's see the two models side-by-side to compare the inference we would draw about the interaction:

# Model estimated without interaction
layout(matrix(1:2, nrow = 1))
plot(NA, xlim = c(0, 1), ylim = c(0, 1), xlab = "x2", ylab = "Predicted Probability of y=1", 
    main = "Estimated without interaction")
# `x1==0`
lines(newdata1$x2[newdata1$x1 == 0], p2a$fit[newdata1$x1 == 0], col = "red")
lines(newdata1$x2[newdata1$x1 == 0], p2a$fit[newdata1$x1 == 0] + 1.96 * p2a$se.fit[newdata1$x1 == 
    0], col = "red", lty = 2)
lines(newdata1$x2[newdata1$x1 == 0], p2a$fit[newdata1$x1 == 0] - 1.96 * p2a$se.fit[newdata1$x1 == 
    0], col = "red", lty = 2)
# `x1==1`
lines(newdata1$x2[newdata1$x1 == 1], p2a$fit[newdata1$x1 == 1], col = "blue")
lines(newdata1$x2[newdata1$x1 == 1], p2a$fit[newdata1$x1 == 1] + 1.96 * p2a$se.fit[newdata1$x1 == 
    1], col = "blue", lty = 2)
lines(newdata1$x2[newdata1$x1 == 1], p2a$fit[newdata1$x1 == 1] - 1.96 * p2a$se.fit[newdata1$x1 == 
    1], col = "blue", lty = 2)
# Model estimated with interaction
plot(NA, xlim = c(0, 1), ylim = c(0, 1), xlab = "x2", ylab = "Predicted Probability of y=1", 
    main = "Estimated with interaction")
# `x1==0`
lines(newdata1$x2[newdata1$x1 == 0], p2b$fit[newdata1$x1 == 0], col = "red")
lines(newdata1$x2[newdata1$x1 == 0], p2b$fit[newdata1$x1 == 0] + 1.96 * p2b$se.fit[newdata1$x1 == 
    0], col = "red", lty = 2)
lines(newdata1$x2[newdata1$x1 == 0], p2b$fit[newdata1$x1 == 0] - 1.96 * p2b$se.fit[newdata1$x1 == 
    0], col = "red", lty = 2)
# `x1==1`
lines(newdata1$x2[newdata1$x1 == 1], p2b$fit[newdata1$x1 == 1], col = "blue")
lines(newdata1$x2[newdata1$x1 == 1], p2b$fit[newdata1$x1 == 1] + 1.96 * p2b$se.fit[newdata1$x1 == 
    1], col = "blue", lty = 2)
lines(newdata1$x2[newdata1$x1 == 1], p2b$fit[newdata1$x1 == 1] - 1.96 * p2b$se.fit[newdata1$x1 == 
    1], col = "blue", lty = 2)

plot of chunk unnamed-chunk-11

The lefthand model leads us to some incorrect inference. Both predicted probability curves are essentially identical, suggesting that the influence of x2 is constant at both levels of x1. This is because our model did not account for any interaction. The righthand model leads us to substantially different inference. When x1==0 (shown in red), there appears to be almost no effect of x2, but when x1==1, the effect of x2 is strongly positive.

Model with additional covariate

When we add an additional covariate to the model, things become much more complicated. Recall that the predicted probabilities have to be calculated on some value of each covariate. In other words, we have to define the predicted probability in terms of all of the covariates in the model. Thus, when we add an additional covariate (even if it does not interact with our focal covariates x1 and x2), we need to account for it when estimating our predicted probabilities. We'll see this at work when we plot the predicted probabilities for our (incorrect) model estimated without the x1*x2 interaction and in our (correct) model estimated with that interaction. No-Interaction model with an additional covariate

plot(NA, xlim = c(0, 1), ylim = c(0, 1), xlab = "x2", ylab = "Predicted Probability of y=1")
s <- sapply(unique(newdata2$x3), function(i) {
    # `x1==0`
    lines(newdata2$x2[newdata2$x1 == 0 & newdata2$x3 == i], p3a$fit[newdata2$x1 == 
        0 & newdata2$x3 == i], col = rgb(1, 0, 0, 0.5))
    lines(newdata2$x2[newdata2$x1 == 0 & newdata2$x3 == i], p3a$fit[newdata2$x1 == 
        0 & newdata2$x3 == i] + 1.96 * p3a$se.fit[newdata2$x1 == 0 & newdata2$x3 == 
        i], col = rgb(1, 0, 0, 0.5), lty = 2)
    lines(newdata2$x2[newdata2$x1 == 0 & newdata2$x3 == i], p3a$fit[newdata2$x1 == 
        0 & newdata2$x3 == i] - 1.96 * p3a$se.fit[newdata2$x1 == 0 & newdata2$x3 == 
        i], col = rgb(1, 0, 0, 0.5), lty = 2)
    # `x1==1`
    lines(newdata2$x2[newdata2$x1 == 1 & newdata2$x3 == i], p3a$fit[newdata2$x1 == 
        1 & newdata2$x3 == i], col = rgb(0, 0, 1, 0.5))
    lines(newdata2$x2[newdata2$x1 == 1 & newdata2$x3 == i], p3a$fit[newdata2$x1 == 
        1 & newdata2$x3 == i] + 1.96 * p3a$se.fit[newdata2$x1 == 1 & newdata2$x3 == 
        i], col = rgb(0, 0, 1, 0.5), lty = 2)
    lines(newdata2$x2[newdata2$x1 == 1 & newdata2$x3 == i], p3a$fit[newdata2$x1 == 
        1 & newdata2$x3 == i] - 1.96 * p3a$se.fit[newdata2$x1 == 1 & newdata2$x3 == 
        i], col = rgb(0, 0, 1, 0.5), lty = 2)
})

plot of chunk unnamed-chunk-12

Note how the above code is much more complicated than previously because we now need to draw a separate predicted probability curve (with associated confidence interval) at each level of x3 even though we're not particularly interested in x3. The result is a very confusing plot because the predicted probability curves at each level ofx3 are the essentially the same, but the confidence intervals vary widely because of different levels of certainty due to the sparsity of the original data.

One common response is to simply draw the curve conditional on all other covariates (in this case x3) being at their means, but this is an arbitrary choice. We could also select minimum or maximum, or any other value. Let's write a small function to redraw our curves at different values of x3 to see the impact of this choice:

ppcurve <- function(value_of_x3, title) {
    tmp <- expand.grid(x1 = 0:1, x2 = seq(0, 1, length.out = 10), x3 = value_of_x3)
    p3tmp <- predict(m3a, tmp, type = "response", se.fit = TRUE)
    plot(NA, xlim = c(0, 1), ylim = c(0, 1), xlab = "x2", ylab = "Predicted Probability of y=1", 
        main = title)
    # `x1==0`
    lines(tmp$x2[tmp$x1 == 0], p3tmp$fit[tmp$x1 == 0], col = "red")
    lines(tmp$x2[tmp$x1 == 0], p3tmp$fit[tmp$x1 == 0] + 1.96 * p3tmp$se.fit[tmp$x1 == 
        0], col = "red", lty = 2)
    lines(tmp$x2[tmp$x1 == 0], p3tmp$fit[tmp$x1 == 0] - 1.96 * p3tmp$se.fit[tmp$x1 == 
        0], col = "red", lty = 2)
    # `x1==1`
    lines(tmp$x2[tmp$x1 == 1], p3tmp$fit[tmp$x1 == 1], col = "blue")
    lines(tmp$x2[tmp$x1 == 1], p3tmp$fit[tmp$x1 == 1] + 1.96 * p3tmp$se.fit[tmp$x1 == 
        1], col = "blue", lty = 2)
    lines(tmp$x2[tmp$x1 == 1], p3tmp$fit[tmp$x1 == 1] - 1.96 * p3tmp$se.fit[tmp$x1 == 
        1], col = "blue", lty = 2)
}

We can then draw a plot that shows the curves for the mean of x3, the minimum of x3 and the maximum of x3.

layout(matrix(1:3, nrow = 1))
ppcurve(mean(x3), title = "x3 at mean")
ppcurve(min(x3), title = "x3 at min")
ppcurve(max(x3), title = "x3 at max")

plot of chunk unnamed-chunk-14

The above set of plots show that while the inference about the predicted probability curves is the same, the choice of what value of x3 to condition on is meaningful for the confidence intervals. The confidence intervals are much narrower when we condition on the mean value of x3 than the minimum or maximum.

Recall that this model did not properly account for the x1*x2 interaction. Thus while our inference is somewhat sensitive to the choice of conditioning value of the x3 covariate, it is unclear if this minimal sensitivity holds when we properly account for the interaction. Let's take a look at our m3b model that accounts for the interaction.

Interaction model with additional covariate

Let's start by drawing a plot showing the predicted values of the outcome for every combination of x1, x2, and x3:

plot(NA, xlim = c(0, 1), ylim = c(0, 1), xlab = "x2", ylab = "Predicted Probability of y=1")
s <- sapply(unique(newdata2$x3), function(i) {
    # `x1==0`
    lines(newdata2$x2[newdata2$x1 == 0 & newdata2$x3 == i], p3b$fit[newdata2$x1 == 
        0 & newdata2$x3 == i], col = rgb(1, 0, 0, 0.5))
    lines(newdata2$x2[newdata2$x1 == 0 & newdata2$x3 == i], p3b$fit[newdata2$x1 == 
        0 & newdata2$x3 == i] + 1.96 * p3b$se.fit[newdata2$x1 == 0 & newdata2$x3 == 
        i], col = rgb(1, 0, 0, 0.5), lty = 2)
    lines(newdata2$x2[newdata2$x1 == 0 & newdata2$x3 == i], p3b$fit[newdata2$x1 == 
        0 & newdata2$x3 == i] - 1.96 * p3b$se.fit[newdata2$x1 == 0 & newdata2$x3 == 
        i], col = rgb(1, 0, 0, 0.5), lty = 2)
    # `x1==1`
    lines(newdata2$x2[newdata2$x1 == 1 & newdata2$x3 == i], p3b$fit[newdata2$x1 == 
        1 & newdata2$x3 == i], col = rgb(0, 0, 1, 0.5))
    lines(newdata2$x2[newdata2$x1 == 1 & newdata2$x3 == i], p3b$fit[newdata2$x1 == 
        1 & newdata2$x3 == i] + 1.96 * p3b$se.fit[newdata2$x1 == 1 & newdata2$x3 == 
        i], col = rgb(0, 0, 1, 0.5), lty = 2)
    lines(newdata2$x2[newdata2$x1 == 1 & newdata2$x3 == i], p3b$fit[newdata2$x1 == 
        1 & newdata2$x3 == i] - 1.96 * p3b$se.fit[newdata2$x1 == 1 & newdata2$x3 == 
        i], col = rgb(0, 0, 1, 0.5), lty = 2)
})

plot of chunk unnamed-chunk-15

This plot is incredibly messy. Now, not only our the confidence bands sensitive to what value of x3 we condition on, so too are the predicted probability curves themselves. It is therefore a fairly important decision what level of additional covariates to condition on when estimating the predicted probabilities.

Marginal Effects Plots

A different approach when deal with interactions is to show marginal effects. Marginal effect, I think, are a bit abstract (i.e., a bit removed from the actual data because they attempt to summarize a lot of information in a single number). The marginal effect is the slope of the curve drawn by taking the difference between, e.g., the predicted probability that y==1 when x1==1 and the predicted probability that y== when x1==0, at each level of x2. Thus, the marginal effect is simply the slope of the difference between the two curves that we were drawing in the above graphs (i.e., the slope of the change in predicted probabilities). Of course, we just saw, if any additional covariate(s) are involved in the data-generating process, then the marginal effect - like the predicted probabilities - is going to differ across levels of that covariate.

Simple interaction model without additional covariates

Let's see how this works by first returning to our simple interaction model (without x3) and then look at the interaction model with the additional covariate.

To plot the change in predicted probabilities due to x1 across the values of x2, we simply need to take our predicted probabilities from above and difference the values predicted for x1==0 and x1==1. The predicted probabilities for our simple interaction model are stored in p2b, based on new data from newdata1. Let's separate out the values predicted from x1==0 and x1==1 and then take their difference. Let's create a new dataframe that binds newdata1 and the predicted probability and standard error values from p2b together. Then we'll use the split function to that dataframe based upon the value of x1.

tmpdf <- newdata1
tmpdf$fit <- p2b$fit
tmpdf$se.fit <- p2b$se.fit
tmpsplit <- split(tmpdf, tmpdf$x1)

The result is a list of two dataframes, each containing values of x1, x2, and the associated predicted probabilities:

tmpsplit
## $`0`
##    x1     x2    fit  se.fit
## 1   0 0.0000 0.5014 0.09235
## 3   0 0.1111 0.5011 0.07665
## 5   0 0.2222 0.5007 0.06320
## 7   0 0.3333 0.5003 0.05373
## 9   0 0.4444 0.5000 0.05053
## 11  0 0.5556 0.4996 0.05470
## 13  0 0.6667 0.4992 0.06484
## 15  0 0.7778 0.4989 0.07867
## 17  0 0.8889 0.4985 0.09459
## 19  0 1.0000 0.4982 0.11171
## 
## $`1`
##    x1     x2    fit  se.fit
## 2   1 0.0000 0.3494 0.09498
## 4   1 0.1111 0.3839 0.08187
## 6   1 0.2222 0.4194 0.06887
## 8   1 0.3333 0.4556 0.05769
## 10  1 0.4444 0.4921 0.05089
## 12  1 0.5556 0.5287 0.05093
## 14  1 0.6667 0.5650 0.05770
## 16  1 0.7778 0.6009 0.06862
## 18  1 0.8889 0.6358 0.08115
## 20  1 1.0000 0.6697 0.09362

To calculate the change in predicted probabilty of y==1 due to x1==1 at each value of x2, we'll simply difference the the fit variable from each dataframe:

me <- tmpsplit[[2]]$fit - tmpsplit[[1]]$fit
me
##  [1] -0.152032 -0.117131 -0.081283 -0.044766 -0.007877  0.029079  0.065793
##  [8]  0.101966  0.137309  0.171555

We also want the standard error of that difference:

me_se <- sqrt(0.5 * (tmpsplit[[2]]$se.fit + tmpsplit[[1]]$se.fit))

Now Let's plot the original predicted probability plot on the left and the change in predicted probability plot on the right:

layout(matrix(1:2, nrow = 1))
plot(NA, xlim = c(0, 1), ylim = c(0, 1), xlab = "x2", ylab = "Predicted Probability of y=1", 
    main = "Predicted Probabilities")
# `x1==0`
lines(newdata1$x2[newdata1$x1 == 0], p2b$fit[newdata1$x1 == 0], col = "red")
lines(newdata1$x2[newdata1$x1 == 0], p2b$fit[newdata1$x1 == 0] + 1.96 * p2b$se.fit[newdata1$x1 == 
    0], col = "red", lty = 2)
lines(newdata1$x2[newdata1$x1 == 0], p2b$fit[newdata1$x1 == 0] - 1.96 * p2b$se.fit[newdata1$x1 == 
    0], col = "red", lty = 2)
# `x1==1`
lines(newdata1$x2[newdata1$x1 == 1], p2b$fit[newdata1$x1 == 1], col = "blue")
lines(newdata1$x2[newdata1$x1 == 1], p2b$fit[newdata1$x1 == 1] + 1.96 * p2b$se.fit[newdata1$x1 == 
    1], col = "blue", lty = 2)
lines(newdata1$x2[newdata1$x1 == 1], p2b$fit[newdata1$x1 == 1] - 1.96 * p2b$se.fit[newdata1$x1 == 
    1], col = "blue", lty = 2)
# plot of change in predicted probabilities:
plot(NA, type = "l", xlim = c(0, 1), ylim = c(-1, 1), xlab = "x2", ylab = "Change in Predicted Probability of y=1", 
    main = "Change in Predicted Probability due to x1")
abline(h = 0, col = "gray")  # gray line at zero
lines(tmpsplit[[1]]$x2, me, lwd = 2)  # change in predicted probabilities
lines(tmpsplit[[1]]$x2, me - 1.96 * me_se, lty = 2)
lines(tmpsplit[[1]]$x2, me + 1.96 * me_se, lty = 2)

plot of chunk unnamed-chunk-20

As should be clear, the plot on the right is simply a further information reduction of the lefthand plot. Where the separate predicted probabilities show the predicted probability of the outcome at each combination of x1 and x2, the righthand plot simply shows the difference between these two curves.

The marginal effect of x2 is thus a further information reduction: it is the slope of the line showing the difference in predicted probabilities. Because our x2 variable is scaled [0,1], we can see the marginal effect simply by subtracting the value of change in predicted probabilities when x2==0 from the value of change in predicted probabilities when x2==1, which is simply:

me[length(me)] - me[1]
## [1] 0.3236

Thus the marginal effect of x1 on the outcome, is the slope of the line representing the change in predicted probabilities between x1==1 and x1==0 across the range of x2. I don't find that a particularly intuitive measure of effect and would instead prefer to draw some kind of plot rather than reduce that plot to a single number.

Simple interaction model without additional covariates

Things get more complicated, as we might expect, when we have to account for the additional covariate x3, which influenced our predicted probabilities above. Our predicted probabilies for these data are stored in p3b (based on input data in newdata2). We'll follow the same procedure just used to add those predicted probabilities into a dataframe with the variables from newdata2, then we'll split it based on x1:

tmpdf <- newdata2
tmpdf$fit <- p3b$fit
tmpdf$se.fit <- p3b$se.fit
tmpsplit <- split(tmpdf, tmpdf$x1)

The result is a list of two large dataframes:

str(tmpsplit)
## List of 2
##  $ 0:'data.frame':   250 obs. of  5 variables:
##   ..$ x1    : int [1:250] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ x2    : num [1:250] 0 0.111 0.222 0.333 0.444 ...
##   ..$ x3    : num [1:250] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ fit   : num [1:250] 0.437 0.451 0.465 0.479 0.493 ...
##   ..$ se.fit: num [1:250] 0.1091 0.0979 0.0893 0.0843 0.0836 ...
##   ..- attr(*, "out.attrs")=List of 2
##   .. ..$ dim     : Named int [1:3] 2 10 25
##   .. .. ..- attr(*, "names")= chr [1:3] "x1" "x2" "x3"
##   .. ..$ dimnames:List of 3
##   .. .. ..$ x1: chr [1:2] "x1=0" "x1=1"
##   .. .. ..$ x2: chr [1:10] "x2=0.0000" "x2=0.1111" "x2=0.2222" "x2=0.3333" ...
##   .. .. ..$ x3: chr [1:25] "x3=0.0000" "x3=0.2083" "x3=0.4167" "x3=0.6250" ...
##  $ 1:'data.frame':   250 obs. of  5 variables:
##   ..$ x1    : int [1:250] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ x2    : num [1:250] 0 0.111 0.222 0.333 0.444 ...
##   ..$ x3    : num [1:250] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ fit   : num [1:250] 0.632 0.642 0.652 0.662 0.672 ...
##   ..$ se.fit: num [1:250] 0.1197 0.1047 0.0917 0.0815 0.0749 ...
##   ..- attr(*, "out.attrs")=List of 2
##   .. ..$ dim     : Named int [1:3] 2 10 25
##   .. .. ..- attr(*, "names")= chr [1:3] "x1" "x2" "x3"
##   .. ..$ dimnames:List of 3
##   .. .. ..$ x1: chr [1:2] "x1=0" "x1=1"
##   .. .. ..$ x2: chr [1:10] "x2=0.0000" "x2=0.1111" "x2=0.2222" "x2=0.3333" ...
##   .. .. ..$ x3: chr [1:25] "x3=0.0000" "x3=0.2083" "x3=0.4167" "x3=0.6250" ...

Now, we need to calculate the change in predicted probability within each of those dataframes, at each value of x3. That is tedious. So let's instead split by both x1 and x3:

tmpsplit <- split(tmpdf, list(tmpdf$x3, tmpdf$x1))

The result is a list of 50 dataframes, the first 25 of which contain data for x1==0 and the latter 25 of which contain data for x1==1:

length(tmpsplit)
## [1] 50
names(tmpsplit)
##  [1] "0.0"                 "0.208333333333333.0" "0.416666666666667.0"
##  [4] "0.625.0"             "0.833333333333333.0" "1.04166666666667.0" 
##  [7] "1.25.0"              "1.45833333333333.0"  "1.66666666666667.0" 
## [10] "1.875.0"             "2.08333333333333.0"  "2.29166666666667.0" 
## [13] "2.5.0"               "2.70833333333333.0"  "2.91666666666667.0" 
## [16] "3.125.0"             "3.33333333333333.0"  "3.54166666666667.0" 
## [19] "3.75.0"              "3.95833333333333.0"  "4.16666666666667.0" 
## [22] "4.375.0"             "4.58333333333333.0"  "4.79166666666667.0" 
## [25] "5.0"                 "0.1"                 "0.208333333333333.1"
## [28] "0.416666666666667.1" "0.625.1"             "0.833333333333333.1"
## [31] "1.04166666666667.1"  "1.25.1"              "1.45833333333333.1" 
## [34] "1.66666666666667.1"  "1.875.1"             "2.08333333333333.1" 
## [37] "2.29166666666667.1"  "2.5.1"               "2.70833333333333.1" 
## [40] "2.91666666666667.1"  "3.125.1"             "3.33333333333333.1" 
## [43] "3.54166666666667.1"  "3.75.1"              "3.95833333333333.1" 
## [46] "4.16666666666667.1"  "4.375.1"             "4.58333333333333.1" 
## [49] "4.79166666666667.1"  "5.1"

We can then calculate our change in predicted probabilities at each level of x1 and x3. We'll use the mapply function to do this quickly:

change <- mapply(function(a, b) b$fit - a$fit, tmpsplit[1:25], tmpsplit[26:50])

The resulting object change is a matrix, each column of which is the change in predicted probability at each level of x3. We can then use this matrix to plot each change in predicted probability on a single plot. Let's again draw this side-by-side with the predicted probability plot:

layout(matrix(1:2, nrow = 1))
# predicted probabilities
plot(NA, xlim = c(0, 1), ylim = c(0, 1), xlab = "x2", ylab = "Predicted Probability of y=1", 
    main = "Predicted Probabilities")
s <- sapply(unique(newdata2$x3), function(i) {
    # `x1==0`
    lines(newdata2$x2[newdata2$x1 == 0 & newdata2$x3 == i], p3b$fit[newdata2$x1 == 
        0 & newdata2$x3 == i], col = rgb(1, 0, 0, 0.5))
    lines(newdata2$x2[newdata2$x1 == 0 & newdata2$x3 == i], p3b$fit[newdata2$x1 == 
        0 & newdata2$x3 == i] + 1.96 * p3b$se.fit[newdata2$x1 == 0 & newdata2$x3 == 
        i], col = rgb(1, 0, 0, 0.5), lty = 2)
    lines(newdata2$x2[newdata2$x1 == 0 & newdata2$x3 == i], p3b$fit[newdata2$x1 == 
        0 & newdata2$x3 == i] - 1.96 * p3b$se.fit[newdata2$x1 == 0 & newdata2$x3 == 
        i], col = rgb(1, 0, 0, 0.5), lty = 2)
    # `x1==1`
    lines(newdata2$x2[newdata2$x1 == 1 & newdata2$x3 == i], p3b$fit[newdata2$x1 == 
        1 & newdata2$x3 == i], col = rgb(0, 0, 1, 0.5))
    lines(newdata2$x2[newdata2$x1 == 1 & newdata2$x3 == i], p3b$fit[newdata2$x1 == 
        1 & newdata2$x3 == i] + 1.96 * p3b$se.fit[newdata2$x1 == 1 & newdata2$x3 == 
        i], col = rgb(0, 0, 1, 0.5), lty = 2)
    lines(newdata2$x2[newdata2$x1 == 1 & newdata2$x3 == i], p3b$fit[newdata2$x1 == 
        1 & newdata2$x3 == i] - 1.96 * p3b$se.fit[newdata2$x1 == 1 & newdata2$x3 == 
        i], col = rgb(0, 0, 1, 0.5), lty = 2)
})
# change in predicted probabilities
plot(NA, type = "l", xlim = c(0, 1), ylim = c(-1, 1), xlab = "x2", ylab = "Change in Predicted Probability of y=1", 
    main = "Change in Predicted Probability due to x1")
abline(h = 0, col = "gray")
apply(change, 2, function(a) lines(tmpsplit[[1]]$x2, a))

plot of chunk unnamed-chunk-27

## NULL

As we can see, despite the craziness of the left-hand plot, the marginal effect of x1 is actually not affected by x3 (which makes sense because it is not interacted with x3 in the data-generating process). Thus, while the choice of value of x3 on which to estimated the predicted probabilities matters, the marginal effect is constant. We can estimate it simply by following the same procedure above from any column of our change matrix:

change[nrow(change), 1] - change[1, 1]
##     0.0 
## -0.0409

The result here is a negligible marginal effect, which is what we would expect given the lack of an interaction between x1 and x3 in the underlying data. If such an interaction were in the actual data, then we should expect that this marginal effect would vary across values of x3 and we would need to further state the marginal effect as conditional on a particular value of x3.

Binary Outcome GLM Plots

Unlike with linear models, interpreting GLMs requires looking at predicted values and this is often easiest to understand in the form of a plot. Let's start by creating some binary outcome data in a simple bivariate model:

set.seed(1)
n <- 100
x <- runif(n, 0, 1)
y <- rbinom(n, 1, x)

If we look at this data, we see that that there is a relationship between x and y, where we are more likely to observe y==1 at higher values of x. We can fit a linear model to these data, but that fit is probably inappropriate, as we can see in the linear fit shown here:

plot(y ~ x, col = NULL, bg = rgb(0, 0, 0, 0.5), pch = 21)
abline(lm(y ~ x), lwd = 2)

plot of chunk unnamed-chunk-2

We can use the predict function to obtain predicted probabilities from other model fits to see if they better fit the data.

Predicted probabilities for the logit model

We can start by fitting a logit model to the data:

m1 <- glm(y ~ x, family = binomial(link = "logit"))

As with OLS, we then construct the input data over which we want to predict the outcome:

newdf <- data.frame(x = seq(0, 1, length.out = 100))

Because GLM relies on a link function, predict allows us to both extract the linear predictions as well as predicted probabilities through the inverse link. The default value for the type argument (type='link') gives predictions on the scale of the linear predicts. For logit models, these are directly interpretable as log-odds, but we'll come back to that in a minute. When we set type='response', we can obtain predicted probabilities:

newdf$pout_logit <- predict(m1, newdf, se.fit = TRUE, type = "response")$fit

We also need to store the standard errors of the predicted probabilities and we can use those to build confidence intervals:

newdf$pse_logit <- predict(m1, newdf, se.fit = TRUE, type = "response")$se.fit
newdf$pupper_logit <- newdf$pout_logit + (1.96 * newdf$pse_logit)  # 95% CI upper bound
newdf$plower_logit <- newdf$pout_logit - (1.96 * newdf$pse_logit)  # 95% CI lower bound

With these data in hand, it is trivial to plot the predicted probability of y for each value of x:

with(newdf, plot(pout_logit ~ x, type = "l", lwd = 2))
with(newdf, lines(pupper_logit ~ x, type = "l", lty = 2))
with(newdf, lines(plower_logit ~ x, type = "l", lty = 2))

plot of chunk unnamed-chunk-7

# ## Predicted probabilities for the probit model ##

We can repeat the above procedure exactly in order to obtain predicted probabilities for the probit model. All we have to change is value of link in our original call to glm:

m2 <- glm(y ~ x, family = binomial(link = "probit"))
newdf$pout_probit <- predict(m2, newdf, se.fit = TRUE, type = "response")$fit
newdf$pse_probit <- predict(m2, newdf, se.fit = TRUE, type = "response")$se.fit
newdf$pupper_probit <- newdf$pout_probit + (1.96 * newdf$pse_probit)
newdf$plower_probit <- newdf$pout_probit - (1.96 * newdf$pse_probit)

Here's the resulting plot, which looks very similar to the one from the logit model:

with(newdf, plot(pout_probit ~ x, type = "l", lwd = 2))
with(newdf, lines(pupper_probit ~ x, type = "l", lty = 2))
with(newdf, lines(plower_probit ~ x, type = "l", lty = 2))

plot of chunk unnamed-chunk-10

Indeed, we can overlay the logit model (in red) and the probit model (in blue) and see that both models provide essentially identical inference. It's also helpful to have the original data underneath to see how the predicted probabilities communicate information about the original data:

# data
plot(y ~ x, col = NULL, bg = rgb(0, 0, 0, 0.5), pch = 21)
# logit
with(newdf, lines(pout_logit ~ x, type = "l", lwd = 2, col = "red"))
with(newdf, lines(pupper_logit ~ x, type = "l", lty = 2, col = "red"))
with(newdf, lines(plower_logit ~ x, type = "l", lty = 2, col = "red"))
# probit
with(newdf, lines(pout_probit ~ x, type = "l", lwd = 2, col = "blue"))
with(newdf, lines(pupper_probit ~ x, type = "l", lty = 2, col = "blue"))
with(newdf, lines(plower_probit ~ x, type = "l", lty = 2, col = "blue"))

plot of chunk unnamed-chunk-11

Clearly, the model does an adequate job predicting y for high and low values of x, but offers a less accurate prediction for middling values. Note: You can see the influence of the logistic distribution's heavier tails in its higher predicted probabilities for y at low values of x (compared to the probit model) and the reverse at high values of x.

Plotting is therefore superior to looking at coefficients in order to compare models. This is especially apparent when we compare the substantively identical plots to the values of the coefficients from each model, which seem (on face value) quite different:

summary(m1)$coef[, 1:2]
##             Estimate Std. Error
## (Intercept)   -2.449     0.5780
## x              4.311     0.9769
summary(m2)$coef[, 1:2]
##             Estimate Std. Error
## (Intercept)   -1.496     0.3289
## x              2.622     0.5568

Log-odds predictions for logit models

As stated above, when dealing with logit models, we can also directly interpet the log-odds predictions from predict. Let's take a look at these using the default type='link' argument in predict:

logodds <- predict(m1, newdf, se.fit = TRUE)$fit

Whereas the predicted probabilities (from above) are strictly bounded [0,1]:

summary(newdf$pout_logit)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0795  0.2020  0.4270  0.4460  0.6870  0.8660

the log-odds are allowed to vary over any value:

summary(logodds)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -2.450  -1.370  -0.294  -0.294   0.784   1.860

We can calculate standard errors, use those to build confidence intervals for the log-odds, and then plot to make a more direct interpretation:

logodds_se <- predict(m1, newdf, se.fit = TRUE)$se.fit
logodds_upper <- logodds + (1.96 * logodds_se)
logodds_lower <- logodds - (1.96 * logodds_se)
plot(logodds ~ newdf$x, type = "l", lwd = 2)
lines(logodds_upper ~ newdf$x, type = "l", lty = 2)
lines(logodds_lower ~ newdf$x, type = "l", lty = 2)

plot of chunk unnamed-chunk-16

From this plot we can see that the log-odds of observing y==1 are positive when x>.5 and negative otherwise. But operating in log-odds is itself confusing because logs are fairly difficult to directly understand. Thus we can translate log-odds to odds by taking exp of the log-odds and redrawing the plot with the new data. Recall that the odds-ratio is the ratio of the betting odds (i.e., the odds of y==1 divided by the odds of y==0 at each value of x). The odds-ratio is strictly lower bounded by 0. When the OR is 1, the ratio of the odds is equal (i.e., at that value of x, we are equally likely to see an observation as y==1 or y==0). We saw this in the earlier plot, where the log-odds changed from negative to positive at x==.5. An OR greater than 1 means that the odds of y==1 are higher than the odds of y==0. When less than 1, the opposite is true.

plot(exp(logodds) ~ newdf$x, type = "l", lwd = 2)
lines(exp(logodds_upper) ~ newdf$x, type = "l", lty = 2)
lines(exp(logodds_lower) ~ newdf$x, type = "l", lty = 2)

plot of chunk unnamed-chunk-17

This plot shows that when x is low, the OR is between 0 and 1, but when x is high, the odds-ratio is quite large. At x==1, the OR is significantly larger than 1 and possibly higher than 6, suggesting that when x==, the odds are 6 times higher for a unit having a value of y==1 than a value of y==0.

Bivariate Regression

Regression on a binary covariate

The easiest way to understand bivariate regression is to view it as equivalent to a two-sample t-test. Imagine we have a binary variable (like male/female or treatment/control):

set.seed(1)
bin <- rbinom(1000, 1, 0.5)

Then we have an outcome that is influenced by that group:

out <- 2 * bin + rnorm(1000)

We can use by to calculate the treatment group means:

by(out, bin, mean)
## bin: 0
## [1] -0.01588
## -------------------------------------------------------- 
## bin: 1
## [1] 1.966

This translates to a difference of:

diff(by(out, bin, mean))
## [1] 1.982

A two-sample t-test shows us whether there is a significant difference between the two groups:

t.test(out ~ bin)
## 
##  Welch Two Sample t-test
## 
## data:  out by bin
## t = -30.3, df = 992.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.111 -1.854
## sample estimates:
## mean in group 0 mean in group 1 
##        -0.01588         1.96624

If we run a linear regression, we find that the mean-difference is the same as the regression slope:

lm(out ~ bin)
## 
## Call:
## lm(formula = out ~ bin)
## 
## Coefficients:
## (Intercept)          bin  
##     -0.0159       1.9821

And t-statistic (and its significance) for the regression slope matches that from the t.test:

summary(lm(out ~ bin))$coef[2, ]
##   Estimate Std. Error    t value   Pr(>|t|) 
##  1.982e+00  6.544e-02  3.029e+01 1.949e-143

It becomes quite easy to see this visually in a plot of the regression:

plot(out ~ bin, col = "gray")
points(0:1, by(out, bin, mean), col = "blue", bg = "blue", pch = 23)
abline(coef(lm(out ~ bin)), col = "blue")

plot of chunk unnamed-chunk-8

Regression on a continuous covariate

A regression involving a continuous covariate is similar, but rather than representing the difference in means between two groups (with the covariate is binary), it represents the conditional mean of the outcome at each level of the covariate. We can see this in some simple fake data:

set.seed(1)
x <- runif(1000, 0, 10)
y <- 3 * x + rnorm(1000, 0, 5)

Here, we'll cut our covariate into five levels and estimate the density of the outcome y in each of those levels:

x1 <- ifelse(x < 2, 1, ifelse(x >= 2 & x < 4, 2, ifelse(x >= 4 & x < 6, 3, ifelse(x >= 
    6 & x < 8, 4, ifelse(x >= 8 & x < 10, 5, NA)))))
d1 <- density(y[x1 == 1])
d2 <- density(y[x1 == 2])
d3 <- density(y[x1 == 3])
d4 <- density(y[x1 == 4])
d5 <- density(y[x1 == 5])

We'll then use those values to show how the regression models the mean of y conditional on x. Let's start with the model:

m1 <- lm(y ~ x)

It is also worth highlighting that, in a bivariate regression model, the regression slope is simply a weighted version of the correlation coefficient. We can see this by calculating the correlation between x and y and then weighting that by the ratio of the covariances of each. You'll see that this is exactly the slope coefficient reported by R:

cor(y, x)
## [1] 0.8593
slope <- cor(y, x) * sqrt(cov(y, y)/cov(x, x))  # manually calculate coefficient as weighted correlation
coef(m1)[2]  # coefficient on x
##     x 
## 3.011
slope
## [1] 3.011

But let's plot the data to get a better understanding of what it looks like:

plot(x, y, col = "gray")
# add the regression equation:
abline(coef(m1), col = "blue")
# add the conditional densities:
abline(v = c(1, 3, 5, 7, 9), col = "gray", lty = 2)
points(1 + d1$y * 10, d1$x, type = "l", col = "black")
points(3 + d2$y * 10, d2$x, type = "l", col = "black")
points(5 + d3$y * 10, d3$x, type = "l", col = "black")
points(7 + d4$y * 10, d4$x, type = "l", col = "black")
points(9 + d5$y * 10, d5$x, type = "l", col = "black")
# add points representing conditional means:
points(1, mean(y[x1 == 1]), col = "red", pch = 15)
points(3, mean(y[x1 == 2]), col = "red", pch = 15)
points(5, mean(y[x1 == 3]), col = "red", pch = 15)
points(7, mean(y[x1 == 4]), col = "red", pch = 15)
points(9, mean(y[x1 == 5]), col = "red", pch = 15)

plot of chunk unnamed-chunk-13

As is clear, the regression line travels through the conditional means of y at each level of x. We can also see in the densities that y is approximately normally distributed at each value of x (because we made our data that way). These data thus nicely satisfy the assumptions for linear regression.

Obviously, our data rarely satisfy those assumptions so nicely. We can modify our fake data to have less desirable properties and see how that affects our inference. Let's put a discontinuity in our y value by simply increasing it by 10 for all values of x greater than 6:

y2 <- y
y2[x > 6] <- y[x > 6] + 10

We can build a new model for these data:

m2 <- lm(y2 ~ x)

Let's estimate the conditional densities, as we did above, but for the new data:

e1 <- density(y2[x1 == 1])
e2 <- density(y2[x1 == 2])
e3 <- density(y2[x1 == 3])
e4 <- density(y2[x1 == 4])
e5 <- density(y2[x1 == 5])

And then let's look at how that model fits the new data:

plot(x, y2, col = "gray")
# add the regression equation:
abline(coef(m2), col = "blue")
# add the conditional densities:
abline(v = c(1, 3, 5, 7, 9), col = "gray", lty = 2)
points(1 + e1$y * 10, e1$x, type = "l", col = "black")
points(3 + e2$y * 10, e2$x, type = "l", col = "black")
points(5 + e3$y * 10, e3$x, type = "l", col = "black")
points(7 + e4$y * 10, e4$x, type = "l", col = "black")
points(9 + e5$y * 10, e5$x, type = "l", col = "black")
# add points representing conditional means:
points(1, mean(y2[x1 == 1]), col = "red", pch = 15)
points(3, mean(y2[x1 == 2]), col = "red", pch = 15)
points(5, mean(y2[x1 == 3]), col = "red", pch = 15)
points(7, mean(y2[x1 == 4]), col = "red", pch = 15)
points(9, mean(y2[x1 == 5]), col = "red", pch = 15)

plot of chunk unnamed-chunk-17

As should be clear in the plot, the line no longer goes through the conditional means (see, especially, the third density curve) because the outcome y is not a linear function of x. To obtain a better fit, we can estimate two separate lines, one on each side of the discontinuing:

m3a <- lm(y2[x <= 6] ~ x[x <= 6])
m3b <- lm(y2[x > 6] ~ x[x > 6])

Now let's redraw our data and the plot for x<=6 in red and the plot for x>6 in blue:

plot(x, y2, col = "gray")
segments(0, coef(m3a)[1], 6, coef(m3a)[1] + 6 * coef(m3a)[2], col = "red")
segments(6, coef(m3b)[1] + (6 * coef(m3b)[2]), 10, coef(m3b)[1] + 10 * coef(m3b)[2], 
    col = "blue")
# redraw the densities:
abline(v = c(1, 3, 5, 7, 9), col = "gray", lty = 2)
points(1 + e1$y * 10, e1$x, type = "l", col = "black")
points(3 + e2$y * 10, e2$x, type = "l", col = "black")
points(5 + e3$y * 10, e3$x, type = "l", col = "black")
points(7 + e4$y * 10, e4$x, type = "l", col = "black")
points(9 + e5$y * 10, e5$x, type = "l", col = "black")
# redraw points representing conditional means:
points(1, mean(y2[x1 == 1]), col = "red", pch = 15)
points(3, mean(y2[x1 == 2]), col = "red", pch = 15)
points(5, mean(y2[x1 == 3]), col = "red", pch = 15)
points(7, mean(y2[x1 == 4]), col = "blue", pch = 15)
points(9, mean(y2[x1 == 5]), col = "blue", pch = 15)

plot of chunk unnamed-chunk-19

Our two new models m3a and m3b are better fits to the data because they satisfy the requirement that the regression line travel through the conditional means of y. Thus, regardless of the form of our covariate(s), our regression models only provide valid inference if the regression line travels through the conditional mean of y for every value of x.

Regression on a discrete covariate

Binary and continuous covariates are easy to model, but we often have data that are not binary or continuous, but instead are categorical. Building regression models with these kinds of variables gives us many options to consider. In particular, developing a model that points through the conditional means of the outcome can be more complicated because the relationship between the outcome and a categorical covariate (if treated as continuous) is unlikely to be linear. We then have to decide how best to model the data. Let's start with some fake data to illustrate this:

a <- sample(1:5, 500, TRUE)
b <- numeric(length = 500)
b[a == 1] <- a[a == 1] + rnorm(sum(a == 1))
b[a == 2] <- 2 * a[a == 2] + rnorm(sum(a == 2))
b[a == 3] <- 2 * a[a == 3] + rnorm(sum(a == 3))
b[a == 4] <- 0.5 * a[a == 4] + rnorm(sum(a == 4))
b[a == 5] <- 2 * a[a == 5] + rnorm(sum(a == 5))

Let's treat a as a continuous covariate, assume the a-b relationship is linear, and build the corresponding linear regression model:

n1 <- lm(b ~ a)

We can see the relationship in the data by plotting b as a function of a:

plot(a, b, col = "gray")
abline(coef(n1), col = "blue")
# draw points representing conditional means:
points(1, mean(b[a == 1]), col = "red", pch = 15)
points(2, mean(b[a == 2]), col = "red", pch = 15)
points(3, mean(b[a == 3]), col = "red", pch = 15)
points(4, mean(b[a == 4]), col = "red", pch = 15)
points(5, mean(b[a == 5]), col = "red", pch = 15)

plot of chunk unnamed-chunk-22

Clearly, the regression line misses the conditional mean values of b at all values of a. Our model is therefore not very good. To correct for this, we can either (1) attempt to transform our variables to force a straight-line (which probably isn't possible in this case, but might be if the relationship were curvilinear) or (2) convert the a covariate to a factor and thus model the relationship as a series of indicator (or “dummy”) variables.

Discrete covariate as factor

When we treat a discrete covariate as a factor, R automatically transforms the variable into a series of indicator variables during the estimation of the regression. Let's compare our original model to this new model:

# our original model (treating `a` as continuous):
summary(n1)
## 
## Call:
## lm(formula = b ~ a)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.392 -0.836  0.683  1.697  4.605 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.4679     0.2569   -1.82    0.069 .  
## a             1.7095     0.0759   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.48 on 498 degrees of freedom
## Multiple R-squared:  0.505,  Adjusted R-squared:  0.504 
## F-statistic:  507 on 1 and 498 DF,  p-value: <2e-16
# our new model:
n2 <- lm(b ~ factor(a))
summary(n2)
## 
## Call:
## lm(formula = b ~ factor(a))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9332 -0.6578 -0.0772  0.7627  2.9459 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.063      0.104   10.21  < 2e-16 ***
## factor(a)2     2.772      0.154   18.03  < 2e-16 ***
## factor(a)3     4.881      0.150   32.47  < 2e-16 ***
## factor(a)4     0.848      0.152    5.56  4.4e-08 ***
## factor(a)5     8.946      0.143   62.35  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.07 on 495 degrees of freedom
## Multiple R-squared:  0.909,  Adjusted R-squared:  0.908 
## F-statistic: 1.23e+03 on 4 and 495 DF,  p-value: <2e-16

Obviously, the regression output is quite different for the two models. For n1, we see the slope of the line we drew in the plot above. For n2, we instead see the slopes comparing b for a==1 to b for all other levels of a (i.e., dummy coefficient slopes). R defaults to taking the lowest factor level as the baseline, but we can change this by reordering the levels of the factor:

# a==5 as baseline:
summary(lm(b ~ factor(a, levels = 5:1)))
## 
## Call:
## lm(formula = b ~ factor(a, levels = 5:1))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9332 -0.6578 -0.0772  0.7627  2.9459 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               10.0087     0.0987   101.4   <2e-16 ***
## factor(a, levels = 5:1)4  -8.0978     0.1487   -54.5   <2e-16 ***
## factor(a, levels = 5:1)3  -4.0642     0.1466   -27.7   <2e-16 ***
## factor(a, levels = 5:1)2  -6.1733     0.1501   -41.1   <2e-16 ***
## factor(a, levels = 5:1)1  -8.9456     0.1435   -62.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.07 on 495 degrees of freedom
## Multiple R-squared:  0.909,  Adjusted R-squared:  0.908 
## F-statistic: 1.23e+03 on 4 and 495 DF,  p-value: <2e-16
# a==4 as baseline:
summary(lm(b ~ factor(a, levels = c(4, 1, 2, 3, 5))))
## 
## Call:
## lm(formula = b ~ factor(a, levels = c(4, 1, 2, 3, 5)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9332 -0.6578 -0.0772  0.7627  2.9459 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)
## (Intercept)                              1.911      0.111   17.17  < 2e-16
## factor(a, levels = c(4, 1, 2, 3, 5))1   -0.848      0.152   -5.56  4.4e-08
## factor(a, levels = c(4, 1, 2, 3, 5))2    1.925      0.159   12.13  < 2e-16
## factor(a, levels = c(4, 1, 2, 3, 5))3    4.034      0.155   25.97  < 2e-16
## factor(a, levels = c(4, 1, 2, 3, 5))5    8.098      0.149   54.45  < 2e-16
##                                          
## (Intercept)                           ***
## factor(a, levels = c(4, 1, 2, 3, 5))1 ***
## factor(a, levels = c(4, 1, 2, 3, 5))2 ***
## factor(a, levels = c(4, 1, 2, 3, 5))3 ***
## factor(a, levels = c(4, 1, 2, 3, 5))5 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.07 on 495 degrees of freedom
## Multiple R-squared:  0.909,  Adjusted R-squared:  0.908 
## F-statistic: 1.23e+03 on 4 and 495 DF,  p-value: <2e-16

Another approach is model the regression without an intercept:

# a==1 as baseline with no intercept:
n3 <- lm(b ~ 0 + factor(a))
summary(n3)
## 
## Call:
## lm(formula = b ~ 0 + factor(a))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9332 -0.6578 -0.0772  0.7627  2.9459 
## 
## Coefficients:
##            Estimate Std. Error t value Pr(>|t|)    
## factor(a)1   1.0631     0.1042    10.2   <2e-16 ***
## factor(a)2   3.8354     0.1131    33.9   <2e-16 ***
## factor(a)3   5.9445     0.1084    54.9   <2e-16 ***
## factor(a)4   1.9109     0.1113    17.2   <2e-16 ***
## factor(a)5  10.0087     0.0987   101.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.07 on 495 degrees of freedom
## Multiple R-squared:  0.968,  Adjusted R-squared:  0.967 
## F-statistic: 2.97e+03 on 5 and 495 DF,  p-value: <2e-16

In this model, the coefficients are exactly the conditionals of b for each value of a:

coef(n3)  # coefficients
## factor(a)1 factor(a)2 factor(a)3 factor(a)4 factor(a)5 
##      1.063      3.835      5.944      1.911     10.009
sapply(1:5, function(x) mean(b[a == x]))  # conditional means
## [1]  1.063  3.835  5.944  1.911 10.009

All of these models produce the same substantive inference, but might simplify interpretation in any particular situation.

##Variable transformations ## Sometimes, rather than forcing the categorical variable to be a set of indicators through the use of factor, we can treat the covariate as continuous once we transform it or the outcome in some way. Let's start with some fake data (based on our previous example):

c <- a^3
d <- 2 * a + rnorm(length(a))

These data have a curvilinear relationship that is not well represented by a linear regression line:

plot(c, d, col = "gray")
sapply(unique(c), function(x) points(x, mean(d[c == x]), col = "red", pch = 15))
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL
abline(lm(d ~ c))

plot of chunk unnamed-chunk-28

As before, we can model this by treating the covariate c as a factor and find that model gives us the conditional means of d:

coef(lm(d ~ 0 + factor(c)))  # coefficients
##   factor(c)1   factor(c)8  factor(c)27  factor(c)64 factor(c)125 
##        1.906        3.988        6.078        7.917       10.188
sapply(sort(unique(c)), function(x) mean(d[c == x]))  # conditional means
## [1]  1.906  3.988  6.078  7.917 10.188

We can also obtain the same substantive inference by transforming the variable(s) to produce a linear fit. In this case, we know (because we made up the data) that there is a cubic relationship between c and d. If we make a new version of the covariate c that is the cube-root of c, we should be able to force a linear fit:

c2 <- c^(1/3)
plot(c2, d, col = "gray")
sapply(unique(c2), function(x) points(x, mean(d[c2 == x]), col = "red", pch = 15))
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL
abline(lm(d ~ c2))

plot of chunk unnamed-chunk-30

We could also transform the outcome d by taking it cubed:

d2 <- d^3
plot(c, d2, col = "gray")
sapply(unique(c), function(x) points(x, mean(d2[c == x]), col = "red", pch = 15))
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL
abline(lm(d2 ~ c))

plot of chunk unnamed-chunk-31

Again, the plot shows this transformation also produces a linear fit. Thus we can reasonably model the relationship between a discrete covariate and a continuous outcome in a number of ways that satisfy the basic assumption of drawing the regression line through the conditional means of the outcome.

Character String Manipulation

Unlike other statistical packages, R has a robust and simple to use set of string manipulation functions. These functions become useful in a number of situations, including: dynamically creating variables, generating tabular and graphical output, reading and writing from text files and the web, and managing character data (e.g., recoding free response or other character data). This tutorial walks through some of the basic string manipulations functions.

paste

The simplest and most important string manipulation function is paste. It allows the user to concatenate character strings (and vectors of character strings) in a number of different ways. The easiest way to use paste is simply to concatenate several values together:

paste("1", "2", "3")
## [1] "1 2 3"

The result is a single string (i.e., one-element character vector) with the numbers separated by spaces (which is the default). We can also separate by other values:

paste("1", "2", "3", sep = ",")
## [1] "1,2,3"

A helpful feature of paste is that it coerces objects to character before concatenating, so we can get the same result above using:

paste(1, 2, 3, sep = ",")
## [1] "1,2,3"

This also means we can combine objects of different classes (e.g., character and numeric):

paste("a", 1, "b", 2, sep = ":")
## [1] "a:1:b:2"

Another helpful feature of paste is that it is vectorized, meaning that we can concatenate each element of two or more vectors in a single call:

a <- letters[1:10]
b <- 1:10
paste(a, b, sep = "")
##  [1] "a1"  "b2"  "c3"  "d4"  "e5"  "f6"  "g7"  "h8"  "i9"  "j10"

The result is a 10-element vector, where the first element of a has been pasted to the first element of b and so forth. We might want to collapse a multi-item vector into a single string and for this we can use the collapse argument to paste:

paste(a, collapse = "")
## [1] "abcdefghij"

Here, all of the elements of a are concatenated into a single string. We can also combine the sep and collapse arguments to obtain different results:

paste(a, b, sep = "", collapse = ",")
## [1] "a1,b2,c3,d4,e5,f6,g7,h8,i9,j10"
paste(a, b, sep = ",", collapse = ";")
## [1] "a,1;b,2;c,3;d,4;e,5;f,6;g,7;h,8;i,9;j,10"

The first result above concatenates corresponding elements from each vector without a space and then separates them by a comma. The second result concatenates corresponding elements with a comma between the elements and separates each pair of elements by semicolon.

strsplit

The strsplit function offers essentially the reversal of paste, by cutting a string into parts based on a separator. Here we can collapse our a vector and then split it back into a vector:

a1 <- paste(a, collapse = ",")
a1
## [1] "a,b,c,d,e,f,g,h,i,j"
strsplit(a1, ",")
## [[1]]
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

Note: strsplit returns a list of results, so accessing the elements requires using the [[]] (double bracket) operator. To get the second element from the split vector, use:

strsplit(a1, ",")[[1]][2]
## [1] "b"

The reason for this return value is that strsplit is also vectorized. So we can split multiple elements of a character vectors in one call:

b1 <- paste(a, b, sep = ",")
b1
##  [1] "a,1"  "b,2"  "c,3"  "d,4"  "e,5"  "f,6"  "g,7"  "h,8"  "i,9"  "j,10"
strsplit(b1, ",")
## [[1]]
## [1] "a" "1"
## 
## [[2]]
## [1] "b" "2"
## 
## [[3]]
## [1] "c" "3"
## 
## [[4]]
## [1] "d" "4"
## 
## [[5]]
## [1] "e" "5"
## 
## [[6]]
## [1] "f" "6"
## 
## [[7]]
## [1] "g" "7"
## 
## [[8]]
## [1] "h" "8"
## 
## [[9]]
## [1] "i" "9"
## 
## [[10]]
## [1] "j"  "10"

The result is a list of split vectors.

Sometimes we want to get every single character from a character string, and for this we can use an empty separator:

strsplit(a1, "")[[1]]
##  [1] "a" "," "b" "," "c" "," "d" "," "e" "," "f" "," "g" "," "h" "," "i"
## [18] "," "j"

The result is every letter and every separator split apart. strsplit also supports much more advanced character splitting using “regular expressions.” We address that in a separate tutorial.

nchar and substring

Sometimes we want to know how many characters are in a string, or just get a subset of the characters. R provides two functions to help us with these operations: nchar and substring. You can think of nchar as analogous to length but instead of telling you how many elements are in a vector it tells you how many characters are in a string:

d <- "abcd"
length(d)
## [1] 1
nchar(d)
## [1] 4

nchar is vectorized, which means we can retrieve the number of characters in each element of a characte vector in one call:

e <- c("abc", "defg", "hi", "jklmnop")
nchar(e)
## [1] 3 4 2 7

substring lets you extract a part of a string based on the position of characters in the string and can be combined with nchar:

f <- "hello"
substring(f, 1, 1)
## [1] "h"
substring(f, 2, nchar(f))
## [1] "ello"

substring is also vectorized. For example we could extract the first character from each element of a vector:

substring(e, 1, 1)
## [1] "a" "d" "h" "j"

Or even the last character of elements with different numbers of characters:

e
## [1] "abc"     "defg"    "hi"      "jklmnop"
nchar(e)
## [1] 3 4 2 7
substring(e, nchar(e), nchar(e))
## [1] "c" "g" "i" "p"

Colors for Plotting

The difference between a simple graph and a visually stunning graph is of course a matter of many features. But one of the biggest contributors to the “wow” factors that often accompanies R graphics is the careful use of color. By default, R graphs tend to be black-and-white and, in fact, rather unattractive. But R provides many functions for carefully controlling the colors that are used in plots. This tutorial looks at some of these functions.

To start, we need to have a baseline graph. We'll use a simple scatterplot. Let's start with some x and y data vectors and a z grouping factor that we'll use later:

set.seed(100)
z <- sample(1:4, 100, TRUE)
x <- rnorm(100)
y <- rnorm(100)

Let's draw the basic scatterplot:

plot(x, y, pch = 15)

plot of chunk unnamed-chunk-2

By default, the points in this plot are black. But we can change that color by specifying a col argument and a character string containing a color. For example, we could make the points red:

plot(x, y, pch = 15, col = "red")

plot of chunk unnamed-chunk-3

or blue:

plot(x, y, pch = 15, col = "blue")

plot of chunk unnamed-chunk-4

R comes with hundreds of colors, which we can see using the colors() function. Let's see the first 25 colors in this:

colors()[1:25]
##  [1] "white"          "aliceblue"      "antiquewhite"   "antiquewhite1" 
##  [5] "antiquewhite2"  "antiquewhite3"  "antiquewhite4"  "aquamarine"    
##  [9] "aquamarine1"    "aquamarine2"    "aquamarine3"    "aquamarine4"   
## [13] "azure"          "azure1"         "azure2"         "azure3"        
## [17] "azure4"         "beige"          "bisque"         "bisque1"       
## [21] "bisque2"        "bisque3"        "bisque4"        "black"         
## [25] "blanchedalmond"

You can specify any of these colors as is.

Color Vector Recycling

An important aspect of R's use of the col argument is the notion of vector recyling. R expects the col argument to have the same length as the number of things its plotting (in this case the number of points). So when we specify col='red', R actually “recycles” the color red for each point, effectively constructing a vector like c('red','red','red',...) equal to the length of our data. We can take advantage of recycling to specify multiple colors. For example, we can specify every other point in our data as being red and blue:

plot(x, y, pch = 15, col = c("red", "blue"))

plot of chunk unnamed-chunk-6

Of course, these colors are not substantively meaningful. Our data are not organized in an alternating fashion. We did, however, have a grouping factor z that takes four levels. We can imagine that these are four substantively important groups in our data that we would like to highlight with different colors. To do that, we could specify a vector of four colors and index it using our z vector:

plot(x, y, pch = 15, col = c("red", "blue", "green", "orange")[z])

plot of chunk unnamed-chunk-7

Now, the four groups each have their own color in the resulting plot. Another strategy is to use the pch (“point character”) argument to identify groups, which we can do using the same logic:

plot(x, y, pch = c(15, 16, 17, 18)[z])

plot of chunk unnamed-chunk-8

But I think colors look better here than different shapes. Of course, sometimes we have to print in grayscale or monochrome, so finding the best combination of shapes and colors may take a bit of work.

Color generation functions

In addition to the named colors, R can also generate any other color pattern in the rainbow using one of several functions. For example, the rgb function can generate a color based on levels of Red, Green, and Blue (thus the rgb name). For example, the color red is simply:

rgb(1, 0, 0)
## [1] "#FF0000"

The result is the color red expressed in hexidecimal format. Two other functions - hsv and hcl - let you specify colors in other ways, but rgb is the easiest, in part, because hexidecimal format is widely used in web publishing so there are many tools online for figuring out how to create the color you want as a combination of red, green, and blue. We can see that specifying col='red' or col=rgb(1,0,0) produce the same graphical result:

plot(x, y, pch = 15, col = "red")

plot of chunk unnamed-chunk-10

plot(x, y, pch = 15, col = rgb(1, 0, 0))

plot of chunk unnamed-chunk-10

But rgb (and the other color-generation functions) are also “vectorized”, meaning that we can supply them with a vector of numbers in order to obtain different shades. For example, to get four shades of red, we can type:

rgb((1:4)/4, 0, 0)
## [1] "#400000" "#800000" "#BF0000" "#FF0000"

If we index this with z (as we did above), we get a plot where are different groups are represented by different shades of red:

plot(x, y, pch = 15, col = rgb((1:4)/4, 0, 0)[z])

plot of chunk unnamed-chunk-12

When we have to print in grayscale, R also supplies a function for building shades of gray, which is called - unsurprisingly - gray. The gray function takes a number between 0 and 1 that specifies a shade of gray between black (0) and white (1):

gray(0.5)
## [1] "#808080"

The response is, again, a hexidecimal color representation. Like rgb, gray is vectorized and we can use it to color our plot:

gray((1:4)/6)
## [1] "#2B2B2B" "#555555" "#808080" "#AAAAAA"
plot(x, y, pch = 15, col = gray((1:4)/6)[z])

plot of chunk unnamed-chunk-14

But R doesn't restrict us to one color palette - just one color or just grayscale. We can also produce “rainbows” of color. For example, we could use the rainbow function to get a rainbow of four different colors and use it on our plot.

plot(x, y, pch = 15, col = rainbow(4)[z])

plot of chunk unnamed-chunk-15

rainbow takes additional arguments, such as start and end that specify where on the rainbow (as measured from 0 to 1) the colors should come from. So, specifying low values for start and end will make a red/yellow-ish plot, middling values will produce a green/blue-ish plot, and high values will prdocue a blue/purple-ish plot:

plot(x, y, pch = 15, col = rainbow(4, start = 0, end = 0.25)[z])

plot of chunk unnamed-chunk-16

plot(x, y, pch = 15, col = rainbow(4, start = 0.35, end = 0.6)[z])

plot of chunk unnamed-chunk-16

plot(x, y, pch = 15, col = rainbow(4, start = 0.7, end = 0.9)[z])

plot of chunk unnamed-chunk-16

Color as data ##

Above we've used color to convey groups within the data. But we can also use color to convey a third variable on our two-dimensional plot. For example, we can imagine that we have some outcome val to which x and y each contribute. We want to see the level of val as it is affected by both x and y. Let's start by creating the val vector as a function of x and y and then use it as a color value:

val <- x + y

Then let's rescale val to be between 0 and 1 to make it easier to use in our color functions:

valcol <- (val + abs(min(val)))/max(val + abs(min(val)))

Now we can use the valcol vector to color our plot using gray:

plot(x, y, pch = 15, col = gray(valcol))

plot of chunk unnamed-chunk-19

We could also use rgb to create a spectrum of blues:

plot(x, y, pch = 15, col = rgb(0, 0, valcol))

plot of chunk unnamed-chunk-20

There are endless other options, but this conveys the basic principles of plot coloring which rely on named colors or a color generation function, and the general R principles of recycling and vectorization.

Comments

Commenting is a way to describe the contents of an R script. Commenting is very important for reproducibility because it helps make sense of code to others and to a future you.

Hash comments

The scripts used in these materials include comments. Any text that follows a has symbol (#) becomes an R comment. Anything can be in a comment. It is ignored by R. You can comment an entire line or just the end of a line, like:

2 + 2  # This is a comment. The code before the `#` is still evaluated by R.
## [1] 4

Some languages provide mutli-line comments. R doesn't have these. Every line has to be commented individually. Most script editors provide the ability to comment out multiple lines at once. This can be helpful if you change your mind about some code:

a <- 1:10

b <- 1:10

b <- 10:1

In the above example, we comment out the line we don't want to run.

Ignoring Code blocks

If there are large blocks of valid R code that you decide you don't want to run, you can wrap them in an if statement:

a <- 10
b <- 10
c <- 10
if (FALSE) {
    a <- 1
    b <- 2
    c <- 3
}
a
## [1] 10
b
## [1] 10
c
## [1] 10

The lines inside the if(FALSE){...} block are not run. If you decide you want to run them after all, you can just change FALSE to TRUE.

R comment function

R also provides a quite useful function called comment that stores a hidden description of an object. This can be useful in interactive sessions for keeping track of a large number of objects. It also has other uses in modelling and plotting that are discussed elsewhere. To add a comment to an object, we simply assign something to the object and then assign it a comment:

d <- 1:10
d
##  [1]  1  2  3  4  5  6  7  8  9 10
comment(d) <- "This is my first vector"
d
##  [1]  1  2  3  4  5  6  7  8  9 10

Adding a comment is similar to adding a names attribute to an object, but the comment is not printed when we call d. To see a comment for an object, we need to use the comment function again:

comment(d)
## [1] "This is my first vector"

If an object has no comment, we receive a NULL result:

e <- 1:5
comment(e)
## NULL

Note: Comments must be valid character vectors. It is not possible to store a numeric value as a comment, but one can have multiple comments:

comment(e) <- c("hi", "bye")
comment(e)
## [1] "hi"  "bye"

And this means that they can be indexed:

comment(e)[2]
## [1] "bye"

And that we can add additional comments:

comment(e)[length(comment(e)) + 1] <- "hello again"
comment(e)
## [1] "hi"          "bye"         "hello again"

Because comments are not printed by default, it is easy to forget about them. But they can be quite useful.

Correlation and partial correlation

Correlation

The correlation coefficients speaks to degree to which two variables can be summarized by a straight line.

set.seed(1)
n <- 1000
x1 <- rnorm(n, -1, 10)
x2 <- rnorm(n, 3, 2)
y <- 5 * x1 + x2 + rnorm(n, 1, 2)

To obtain the correlation of two variables we simply list them consecutively:

cor(x1, x2)
## [1] 0.006401

If we want to test the significance of the correlation, we need to use the cor.test function:

cor.test(x1, x2)
## 
##  Pearson's product-moment correlation
## 
## data:  x1 and x2
## t = 0.2022, df = 998, p-value = 0.8398
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.05561  0.06837
## sample estimates:
##      cor 
## 0.006401

To obtain a correlation matrix, we have to supply an input matrix:

cor(cbind(x1, x2, y))
##          x1       x2       y
## x1 1.000000 0.006401 0.99838
## x2 0.006401 1.000000 0.04731
## y  0.998376 0.047312 1.00000

Correlations of non-linear relationships

a <- rnorm(n)
b <- a^2 + rnorm(n)

If we plot the relationship of b on a, we see a strong (non-linear) relationship:

plot(b ~ a)

plot of chunk unnamed-chunk-6

Yet the correlation between the two variables is low:

cor(a, b)
## [1] -0.01712
cor.test(a, b)
## 
##  Pearson's product-moment correlation
## 
## data:  a and b
## t = -0.541, df = 998, p-value = 0.5886
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.07903  0.04492
## sample estimates:
##      cor 
## -0.01712

If we can identify the functional form of the relationship, however, we can figure out what the relationship really is. Clearly a linear relationship is inappropriate:

plot(b ~ a, col = "gray")
curve((x), col = "red", add = TRUE)

plot of chunk unnamed-chunk-8

But what about y ~ x^2 (of course, we know this is correct):

plot(b ~ a, col = "gray")
curve((x^2), col = "blue", add = TRUE)

plot of chunk unnamed-chunk-9

The correlation between b and a2 is thus much higher:

cor(a^2, b)
## [1] 0.843

We can see this visually by plotting b against the transformed a variable:

plot(b ~ I(a^2), col = "gray")

plot of chunk unnamed-chunk-11

If we now overlay a linear relationship, we see how well the transformed data are represented by a line:

plot(b ~ I(a^2), col = "gray")
curve((x), col = "blue", add = TRUE)

plot of chunk unnamed-chunk-12

Now let's see this side-by-side to see how the transform works:

layout(matrix(1:2, nrow = 1))
plot(b ~ a, col = "gray")
curve((x^2), col = "blue", add = TRUE)
plot(b ~ I(a^2), col = "gray")
curve((x), col = "blue", add = TRUE)

plot of chunk unnamed-chunk-13

Partial correlations

An approach that is sometimes used to examine the effects of variables involves “partial correlations.” A partial correlation measures the strength of the linear relationship between two variables, controlling for the influence of one or more covariates. For example, the correlation of y and z is:

z <- x1 + rnorm(n, 0, 2)
cor(y, z)
## [1] 0.9813

This correlation might be inflated or deflated to do the common antecedent variable x1 in both y and z. Thus we may want to remove the variation due to x1 from both y and z via linear regression:

part1 <- lm(y ~ x1)
part2 <- lm(z ~ x1)

The correlation of the residuals of those two models is thus the partial correlation:

cor(part1$residual, part2$residual)
## [1] 0.03828

As we can see, the correlation between these variables is actually much lower once we account for the variation attributable to x1.

Count Regression Models

We sometimes want to estimate models of count outcomes. Depending on substantive assumptions, we can model these using a linear model, an ordered outcome model, or a count-specific model. This tutorial talks about count models, specifically poisson models and negative beta binomial models. Poisson models can be estimated using R's base glm function, but negative beta binomial regression requires teh MASS add-on package, which is a recommended and therefore is pre-installed and you simply need to load it.

# poisson(link = 'log')

library(MASS)
# glm.nb()

Data Archiving

As part of reproducible research, it is critical to make data and replication files publicly available. Within political science, The Dataverse Network is increasingly seen as the disciplinary standard for where and how to permanently archive data and replication files. This tutorial works through how to archive study data, files, and metadata at The Dataverse Network directly through R.

The Dataverse Network

The Dataverse Network, created by the Institute for Quantitative Social Science at Harvard University, is software and an associated network of websites that permanently archive social data for postereity. The service is free to use, relatively simple, and strengthened by a recently added Application Programming Interface (API) that allows researchers to deposit into the Dataverse directly from R through the dvn package.

The dvn package

To deposit data in the Dataverse, you need to have an account. You can pick which dataverse website you want to use, but I recommend using the Harvard Dataverse, where much political science data is stored. Once you create an account and configure a personal dataverse, you can do almost everything else directly in R. To get started, install and load the dvn package:

install.packages("dvn", repos = "http://cran.r-project.org")
## Warning: package 'dvn' is in use and will not be installed
library(dvn)

Once installed, you'll need to setup your username and password using:

options(dvn.user = "username", dvn.pwd = "password")

Remember not to share your username and password with others. Since the remainder of this tutorial only works with a proper username and password, the following code is commented out, but should run on your machine: You can check to make sure your login credentials work by retrieving your Service Document:

# dvServiceDoc()

If that succeeds, then you can easily create a study by setting up some metadata (e.g., the title, author, etc. for your study) and then using dvCreateStudy to create that the study listing.

# writeLines(dvBuildMetadata(title='My Study'),'mystudy.xml') created <-
# dvCreateStudy('mydataverse','mystudy.xml')

Then, you need to add files. dvn is versatile with regard to how to do this, allowing you to submit either filenames as character strings:

# dvAddFile(created$objectId,filename=c('file1.csv','file2.txt'))

dataframes currently loaded in memory:

# dvAddFile(created$objectId,dataframe=mydf)

or a .zip file containing multiple files:

# dvAddFile(created$objectId,filename='files.zip')

You can then confirm that everything has been uploaded successfully by examining the Study Statement:

# dvStudyStatement(created$objectId)

If everything looks good, you can then release the study publicly:

# dvReleaseStudy(created$objectid)

The dvn package also allows you to modify the metadata and delete files, but the above constitutes a complete workflow to making your data publicly available. See the package documentation for more details.

Searching for data using dvn

dvn additionally allows you to search for study data directly from R. For example, you can find all of my publicly available data using:

search <- dvSearch(list(authorName = "leeper"))
## 6 search results returned

Thus archiving your data on The Dataverse Network makes it readily accessible to R users everywhere, forever.

Dataframe rearrangement

In addition to knowing how to index and view dataframes, as is discussed in other tutorials, it is also helpful to be able to adjust the arrangement of dataframes. By this I mean that it is sometimes helpful to split, sample, reorder, reshape, or otherwise change the organization of a dataframe. This tutorial explains a couple of functions that can help with these kinds of tasks. Note: One of the most important things to remember about R dataframes is that it rarely, if ever, matters what order observations or variables are have in a dataframe. Whereas in SPSS and SAS observations have to be sorted before performing operations, R does not require such sorting.

Column order

Sometimes we want to get dataframe columns in a different order from how they're read into the data. In most cases, though, we can just index the dataframe to see relevant columns rather reordering, but we can do the reordering if we want. Say we have the following 5-column dataframe:

set.seed(50)
mydf <- data.frame(a = rep(1:2, each = 10), b = rep(1:4, times = 5), c = rnorm(20), 
    d = rnorm(20), e = sample(1:20, 20, FALSE))
head(mydf)
##   a b       c       d  e
## 1 1 1  0.5497 -0.3499 11
## 2 1 2 -0.8416 -0.5869  1
## 3 1 3  0.0330 -1.5899  7
## 4 1 4  0.5241  1.6896  8
## 5 1 1 -1.7276  0.5636  9
## 6 1 2 -0.2779  2.6676 19

To view the columns in a different order, we can simply index the dataframe differently either by name or column position:

head(mydf[, c(3, 4, 5, 1, 2)])
##         c       d  e a b
## 1  0.5497 -0.3499 11 1 1
## 2 -0.8416 -0.5869  1 1 2
## 3  0.0330 -1.5899  7 1 3
## 4  0.5241  1.6896  8 1 4
## 5 -1.7276  0.5636  9 1 1
## 6 -0.2779  2.6676 19 1 2
head(mydf[, c("c", "d", "e", "a", "b")])
##         c       d  e a b
## 1  0.5497 -0.3499 11 1 1
## 2 -0.8416 -0.5869  1 1 2
## 3  0.0330 -1.5899  7 1 3
## 4  0.5241  1.6896  8 1 4
## 5 -1.7276  0.5636  9 1 1
## 6 -0.2779  2.6676 19 1 2

We can save the adjusted column order if we want:

mydf <- mydf[, c(3, 4, 5, 1, 2)]
head(mydf)
##         c       d  e a b
## 1  0.5497 -0.3499 11 1 1
## 2 -0.8416 -0.5869  1 1 2
## 3  0.0330 -1.5899  7 1 3
## 4  0.5241  1.6896  8 1 4
## 5 -1.7276  0.5636  9 1 1
## 6 -0.2779  2.6676 19 1 2

Row order

Changing row order works the same way as changing column order. We can simply index the dataframe in a different way. For example, let's say we want to reverse the order of the dataframe, we can simply write:

mydf[nrow(mydf):1, ]
##          c        d  e a b
## 20 -0.3234  0.39322  5 2 4
## 19 -1.1660  0.40619  4 2 3
## 18 -0.7653  1.83968  3 2 2
## 17 -0.1568  0.41620 20 2 1
## 16 -0.3629  0.01910  2 2 4
## 15 -0.4555 -1.09605 14 2 3
## 14  0.1957  0.59725  6 2 2
## 13 -0.4986 -1.13045 13 2 1
## 12  0.5548 -0.85142 10 2 4
## 11  0.2952  0.19902 12 2 3
## 10 -1.4457  0.02867 17 1 2
## 9   0.9756  0.56875 18 1 1
## 8  -0.5909 -0.36212 16 1 4
## 7   0.3608  0.35653 15 1 3
## 6  -0.2779  2.66763 19 1 2
## 5  -1.7276  0.56358  9 1 1
## 4   0.5241  1.68956  8 1 4
## 3   0.0330 -1.58988  7 1 3
## 2  -0.8416 -0.58690  1 1 2
## 1   0.5497 -0.34993 11 1 1

And then we can save this new order if we want:

mydf <- mydf[nrow(mydf):1, ]

Rarely, however, do we want to just reorder by hand. Instead we might want to reorder according to the values of a column. One's intuition might be to use the sort function because it is used to sort a vector:

mydf$e
##  [1]  5  4  3 20  2 14  6 13 10 12 17 18 16 15 19  9  8  7  1 11
sort(mydf$e)
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

But trying to run sort(mydf) will produce an error. Instead, we need to use the order function, which returns the indexes of a sorted vector. Confusing? let's see how it works:

order(mydf$e)
##  [1] 19  5  3  2  1  7 18 17 16  9 20 10  8  6 14 13 11 12 15  4

That doesn't look like a sorted vector, but this is because the values being shown are the indices of the vector, not the values themselves. If we index the mydf$e vector by the output of order(mydf$e), it will be in the order we're expecting:

mydf$e[order(mydf$e)]
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

We can apply this same logic to sorting a dataframe. We simply pick which column we want to order by and then use the output of order as a row index. Let's compare the reordered dataframe to the original:

head(mydf[order(mydf$e), ])
##          c       d e a b
## 2  -0.8416 -0.5869 1 1 2
## 16 -0.3629  0.0191 2 2 4
## 18 -0.7653  1.8397 3 2 2
## 19 -1.1660  0.4062 4 2 3
## 20 -0.3234  0.3932 5 2 4
## 14  0.1957  0.5972 6 2 2
head(mydf)  # original
##          c       d  e a b
## 20 -0.3234  0.3932  5 2 4
## 19 -1.1660  0.4062  4 2 3
## 18 -0.7653  1.8397  3 2 2
## 17 -0.1568  0.4162 20 2 1
## 16 -0.3629  0.0191  2 2 4
## 15 -0.4555 -1.0960 14 2 3

Of course, we could save the reordered dataframe just as above:

mydf <- mydf[order(mydf$e), ]

Subset of rows

Another common operation is to look at a subset of dataframe rows. For example, we might want to look at just the rows where mydf$a==1. Remembering the rules for indexing a dataframe, we can simply index according to a logical rule:

mydf[mydf$a == 1, ]
##          c        d  e a b
## 2  -0.8416 -0.58690  1 1 2
## 3   0.0330 -1.58988  7 1 3
## 4   0.5241  1.68956  8 1 4
## 5  -1.7276  0.56358  9 1 1
## 1   0.5497 -0.34993 11 1 1
## 7   0.3608  0.35653 15 1 3
## 8  -0.5909 -0.36212 16 1 4
## 10 -1.4457  0.02867 17 1 2
## 9   0.9756  0.56875 18 1 1
## 6  -0.2779  2.66763 19 1 2

And to get the rows where mydf$a==2, we can do quite the same operation:

mydf[mydf$a == 2, ]
##          c       d  e a b
## 16 -0.3629  0.0191  2 2 4
## 18 -0.7653  1.8397  3 2 2
## 19 -1.1660  0.4062  4 2 3
## 20 -0.3234  0.3932  5 2 4
## 14  0.1957  0.5972  6 2 2
## 12  0.5548 -0.8514 10 2 4
## 11  0.2952  0.1990 12 2 3
## 13 -0.4986 -1.1304 13 2 1
## 15 -0.4555 -1.0960 14 2 3
## 17 -0.1568  0.4162 20 2 1

We can also combine logical rules to get a further subset of values:

mydf[mydf$a == 1 & mydf$b == 4, ]
##         c       d  e a b
## 4  0.5241  1.6896  8 1 4
## 8 -0.5909 -0.3621 16 1 4

And we need to restrict ourselves to equivalency logicals:

mydf[mydf$a == 1 & mydf$b > 2, ]
##         c       d  e a b
## 3  0.0330 -1.5899  7 1 3
## 4  0.5241  1.6896  8 1 4
## 7  0.3608  0.3565 15 1 3
## 8 -0.5909 -0.3621 16 1 4

R also supplies a subset function, which can be used to select subsets of rows, subsets of columns, or both. It works like so:

# subset of rows:
subset(mydf, a == 1)
##          c        d  e a b
## 2  -0.8416 -0.58690  1 1 2
## 3   0.0330 -1.58988  7 1 3
## 4   0.5241  1.68956  8 1 4
## 5  -1.7276  0.56358  9 1 1
## 1   0.5497 -0.34993 11 1 1
## 7   0.3608  0.35653 15 1 3
## 8  -0.5909 -0.36212 16 1 4
## 10 -1.4457  0.02867 17 1 2
## 9   0.9756  0.56875 18 1 1
## 6  -0.2779  2.66763 19 1 2
subset(mydf, a == 1 & b > 2)
##         c       d  e a b
## 3  0.0330 -1.5899  7 1 3
## 4  0.5241  1.6896  8 1 4
## 7  0.3608  0.3565 15 1 3
## 8 -0.5909 -0.3621 16 1 4
# subset of columns:
subset(mydf, select = c("a", "b"))
##    a b
## 2  1 2
## 16 2 4
## 18 2 2
## 19 2 3
## 20 2 4
## 14 2 2
## 3  1 3
## 4  1 4
## 5  1 1
## 12 2 4
## 1  1 1
## 11 2 3
## 13 2 1
## 15 2 3
## 7  1 3
## 8  1 4
## 10 1 2
## 9  1 1
## 6  1 2
## 17 2 1
# subset of rows and columns:
subset(mydf, a == 1 & b > 2, select = c("c", "d"))
##         c       d
## 3  0.0330 -1.5899
## 4  0.5241  1.6896
## 7  0.3608  0.3565
## 8 -0.5909 -0.3621

Using indices and subset are equivalent, but the indexing syntax is more general.

Splitting a dataframe

In one of the above examples, we extracted two separate dataframes: one for mydf$a==1 and one for mydf$a==2. We can actually achieve that result using a single line of code involving the split function, which returns a list of dataframes, separated out by a grouping factor:

split(mydf, mydf$a)
## $`1`
##          c        d  e a b
## 2  -0.8416 -0.58690  1 1 2
## 3   0.0330 -1.58988  7 1 3
## 4   0.5241  1.68956  8 1 4
## 5  -1.7276  0.56358  9 1 1
## 1   0.5497 -0.34993 11 1 1
## 7   0.3608  0.35653 15 1 3
## 8  -0.5909 -0.36212 16 1 4
## 10 -1.4457  0.02867 17 1 2
## 9   0.9756  0.56875 18 1 1
## 6  -0.2779  2.66763 19 1 2
## 
## $`2`
##          c       d  e a b
## 16 -0.3629  0.0191  2 2 4
## 18 -0.7653  1.8397  3 2 2
## 19 -1.1660  0.4062  4 2 3
## 20 -0.3234  0.3932  5 2 4
## 14  0.1957  0.5972  6 2 2
## 12  0.5548 -0.8514 10 2 4
## 11  0.2952  0.1990 12 2 3
## 13 -0.4986 -1.1304 13 2 1
## 15 -0.4555 -1.0960 14 2 3
## 17 -0.1568  0.4162 20 2 1

We can also split by multiple factors, e.g., a dataframe for every unique combination of mydf$a and mydf$b:

split(mydf, list(mydf$a, mydf$b))
## $`1.1`
##         c       d  e a b
## 5 -1.7276  0.5636  9 1 1
## 1  0.5497 -0.3499 11 1 1
## 9  0.9756  0.5687 18 1 1
## 
## $`2.1`
##          c       d  e a b
## 13 -0.4986 -1.1304 13 2 1
## 17 -0.1568  0.4162 20 2 1
## 
## $`1.2`
##          c        d  e a b
## 2  -0.8416 -0.58690  1 1 2
## 10 -1.4457  0.02867 17 1 2
## 6  -0.2779  2.66763 19 1 2
## 
## $`2.2`
##          c      d e a b
## 18 -0.7653 1.8397 3 2 2
## 14  0.1957 0.5972 6 2 2
## 
## $`1.3`
##        c       d  e a b
## 3 0.0330 -1.5899  7 1 3
## 7 0.3608  0.3565 15 1 3
## 
## $`2.3`
##          c       d  e a b
## 19 -1.1660  0.4062  4 2 3
## 11  0.2952  0.1990 12 2 3
## 15 -0.4555 -1.0960 14 2 3
## 
## $`1.4`
##         c       d  e a b
## 4  0.5241  1.6896  8 1 4
## 8 -0.5909 -0.3621 16 1 4
## 
## $`2.4`
##          c       d  e a b
## 16 -0.3629  0.0191  2 2 4
## 20 -0.3234  0.3932  5 2 4
## 12  0.5548 -0.8514 10 2 4

Having our dataframes stored inside another object might seem inconvenient, but it actually is vary useful because we can use functions like lapply to perform an operation on every dataframe in the list. For example, we could get the summary of every variable in each of two subsets of the dataframe in a single line of code:

lapply(split(mydf, mydf$a), summary)
## $`1`
##        c                d                e               a    
##  Min.   :-1.728   Min.   :-1.590   Min.   : 1.00   Min.   :1  
##  1st Qu.:-0.779   1st Qu.:-0.359   1st Qu.: 8.25   1st Qu.:1  
##  Median :-0.122   Median : 0.193   Median :13.00   Median :1  
##  Mean   :-0.244   Mean   : 0.299   Mean   :12.10   Mean   :1  
##  3rd Qu.: 0.483   3rd Qu.: 0.568   3rd Qu.:16.75   3rd Qu.:1  
##  Max.   : 0.976   Max.   : 2.668   Max.   :19.00   Max.   :1  
##        b       
##  Min.   :1.00  
##  1st Qu.:1.25  
##  Median :2.00  
##  Mean   :2.30  
##  3rd Qu.:3.00  
##  Max.   :4.00  
## 
## $`2`
##        c                d                 e               a    
##  Min.   :-1.166   Min.   :-1.1304   Min.   : 2.00   Min.   :2  
##  1st Qu.:-0.488   1st Qu.:-0.6338   1st Qu.: 4.25   1st Qu.:2  
##  Median :-0.343   Median : 0.2961   Median : 8.00   Median :2  
##  Mean   :-0.268   Mean   : 0.0793   Mean   : 8.90   Mean   :2  
##  3rd Qu.: 0.108   3rd Qu.: 0.4137   3rd Qu.:12.75   3rd Qu.:2  
##  Max.   : 0.555   Max.   : 1.8397   Max.   :20.00   Max.   :2  
##        b       
##  Min.   :1.00  
##  1st Qu.:2.00  
##  Median :3.00  
##  Mean   :2.70  
##  3rd Qu.:3.75  
##  Max.   :4.00

Sampling and permutations

Another common task is random sampling or permutation of rows in a dataframe. For example, we might want to build a regression model on a random subset of cases (a “training set”) and then test the model on the remaining case (a “test set”). Or, we might want to look at a random sample of the observations (e.g., perhaps to speed up a very time-consuming analysis). Let's consider the case of sampling for “training” and “test” sets. To obtain a random sample, we have two choices. We can either sample a specified number of rows or we can use a logical index to sample rows based on a specified probability. Both use the sample function. To look at, e.g., exactly five randomly selected rows from our data frame as the training set, we can do the following:

s <- sample(1:nrow(mydf), 5, FALSE)
s
## [1] 11 19 12  1 17

Note: The third argument (FALSE) refers to whether sampling should be done with replacement. We can then use that directly as a row index:

mydf[s, ]
##          c        d  e a b
## 1   0.5497 -0.34993 11 1 1
## 6  -0.2779  2.66763 19 1 2
## 11  0.2952  0.19902 12 2 3
## 2  -0.8416 -0.58690  1 1 2
## 10 -1.4457  0.02867 17 1 2

To see the test set, we simply drop all rows not in s:

mydf[-s, ]
##          c       d  e a b
## 16 -0.3629  0.0191  2 2 4
## 18 -0.7653  1.8397  3 2 2
## 19 -1.1660  0.4062  4 2 3
## 20 -0.3234  0.3932  5 2 4
## 14  0.1957  0.5972  6 2 2
## 3   0.0330 -1.5899  7 1 3
## 4   0.5241  1.6896  8 1 4
## 5  -1.7276  0.5636  9 1 1
## 12  0.5548 -0.8514 10 2 4
## 13 -0.4986 -1.1304 13 2 1
## 15 -0.4555 -1.0960 14 2 3
## 7   0.3608  0.3565 15 1 3
## 8  -0.5909 -0.3621 16 1 4
## 9   0.9756  0.5687 18 1 1
## 17 -0.1568  0.4162 20 2 1

An alternative is to get a random 20% of the rows but not require that to be exactly five observations. To do that, we make 20 random draws (i.e., a number of draws equal to the number of rows in our dataframe) from a binomial distribution with probability .2:

s2 <- rbinom(nrow(mydf), 1, 0.2)
s2
##  [1] 1 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0

We can then use that directly as a row index:

mydf[s2, ]
##           c       d e a b
## 2   -0.8416 -0.5869 1 1 2
## 2.1 -0.8416 -0.5869 1 1 2
## 2.2 -0.8416 -0.5869 1 1 2
## 2.3 -0.8416 -0.5869 1 1 2

And again see the test set as those observations not in s2.

mydf[!s2, ]
##          c        d  e a b
## 16 -0.3629  0.01910  2 2 4
## 18 -0.7653  1.83968  3 2 2
## 20 -0.3234  0.39322  5 2 4
## 3   0.0330 -1.58988  7 1 3
## 4   0.5241  1.68956  8 1 4
## 5  -1.7276  0.56358  9 1 1
## 12  0.5548 -0.85142 10 2 4
## 1   0.5497 -0.34993 11 1 1
## 11  0.2952  0.19902 12 2 3
## 13 -0.4986 -1.13045 13 2 1
## 15 -0.4555 -1.09605 14 2 3
## 8  -0.5909 -0.36212 16 1 4
## 10 -1.4457  0.02867 17 1 2
## 9   0.9756  0.56875 18 1 1
## 6  -0.2779  2.66763 19 1 2
## 17 -0.1568  0.41620 20 2 1

Note: Here we use !s2 because s2 is a logical index, whereas above we used -s because s was a positional index.

Dataframe Structure

Dataframes are integrally important to using R for any kind of data analysis. One of the most frustrating aspects of R for new users is that, unlike Excel, or even SPSS or Stata, it is not terribly easy to look at and modify data in a spreadsheet like format. In the tutorial on dataframes as a class, you should have learned a bit about what dataframes are and how to index and modify them. Here we are going to discuss how to look at dataframes in a variety of ways.

print, summary, and str

Looking at dataframes in R is actually pretty easy. Because a dataframe is an R object, we can simply print it to the console by calling its name. Let's create a dataframe and try this:

mydf <- data.frame(a = rbinom(100, 1, 0.5), b = rnorm(100), c = rnorm(100), 
    d = rnorm(100), e = sample(LETTERS, 100, TRUE))
mydf
##     a        b         c        d e
## 1   0 -0.65302  1.617287  0.35947 O
## 2   0 -1.56067  1.374269  1.16150 U
## 3   0 -0.88265  0.561109  1.50842 Y
## 4   0 -0.64753  1.414186 -1.33762 N
## 5   0 -0.94923 -0.964017 -2.29660 E
## 6   0  1.12688 -0.616431 -1.96846 F
## 7   0  1.72761  0.008532 -0.73825 W
## 8   1 -0.29763  1.572682 -0.19632 T
## 9   0 -0.24442  0.053971  2.59850 Z
## 10  0 -0.84921 -0.189399 -1.13353 K
## 11  0  0.11510 -0.043527 -1.89618 B
## 12  1  0.70786  0.024526 -1.08325 S
## 13  0 -0.92021 -3.408887 -0.70295 E
## 14  0  1.13397 -0.029900  0.55542 V
## 15  0  0.04453  0.373467 -0.61795 S
## 16  1  1.47634  0.944661 -0.36271 Z
## 17  0  1.62780 -0.603154 -0.07608 J
## 18  0  0.78341 -0.591424  0.36601 Q
## 19  0 -0.07220 -1.497778  0.70145 Y
## 20  0 -1.32925  1.888501  1.05821 J
## 21  0  1.08259 -2.293813  0.49702 C
## 22  1  0.73256 -0.552174 -0.72288 B
## 23  1 -0.30210  0.576488 -0.94125 R
## 24  1 -0.39293 -0.186210  0.82022 J
## 25  0 -2.64254  1.022059 -1.40601 T
## 26  0 -0.22410  1.673398 -2.00373 Z
## 27  0  1.95346 -1.285846  1.67366 V
## 28  0 -0.58287  0.930812 -1.99689 P
## 29  1  1.06114  0.512845 -0.96299 N
## 30  1  0.75882 -0.544033 -0.87342 Z
## 31  0  0.58825 -0.537684  0.27048 H
## 32  0 -0.43292  0.762805 -0.18115 V
## 33  0 -0.09822  0.144783 -1.51522 O
## 34  1  1.38690  0.202230 -0.92736 S
## 35  0 -1.31040 -1.456198  2.06663 W
## 36  0 -0.67905  1.053445  0.11093 G
## 37  0  1.20022 -0.397525  0.10330 Q
## 38  1  0.99828  0.810732 -0.43627 I
## 39  1 -0.55813  0.300040  0.82089 C
## 40  1  0.19107 -0.732265  1.64319 L
## 41  1 -0.93658 -0.803333  0.65210 R
## 42  0  1.71818 -0.259426  1.72735 L
## 43  0  0.79274 -1.577459 -2.33531 Y
## 44  1 -0.17978  0.387909 -0.04763 T
## 45  0 -1.27127 -0.731157 -0.23587 J
## 46  1  0.36220 -1.182620 -1.58457 H
## 47  1  2.26727  1.503092 -1.20872 D
## 48  0 -0.56679 -1.205823  0.30645 Q
## 49  0  1.18184  0.274242 -0.25508 H
## 50  0 -0.43997 -1.203856  0.03733 Z
## 51  0 -0.21525  0.175392  1.54721 T
## 52  0  0.17862  2.041101  0.48442 Y
## 53  1  2.82008  1.209535 -0.67040 X
## 54  0 -0.02909 -0.379774  0.13640 X
## 55  1 -0.52543 -0.976383 -0.44816 W
## 56  1  0.92736 -0.066320 -1.38853 A
## 57  0  0.81235 -1.163808  0.02140 N
## 58  1 -1.63686 -0.670042 -0.55861 P
## 59  1 -1.45887 -0.257498  0.66978 K
## 60  1  0.36716  0.092494 -0.59397 M
## 61  1  0.50476 -1.691161  0.13602 U
## 62  1 -0.53350 -0.781128  0.39872 T
## 63  0  0.13419 -1.218642  0.43340 X
## 64  0  0.68213 -0.262076 -0.57323 U
## 65  0 -2.09181  1.600879  0.16202 L
## 66  1 -1.35759  0.271196 -1.45684 R
## 67  0 -0.64975  0.404372 -0.44506 V
## 68  0 -0.33656 -0.662692  0.20784 R
## 69  1 -1.19379 -1.547217 -1.40629 Y
## 70  1  0.48648 -1.117218 -0.12517 R
## 71  0 -1.03210 -0.369793 -0.74953 X
## 72  0  0.34542  0.494358 -1.19533 Z
## 73  1  0.41408  0.264469 -2.49834 O
## 74  1 -0.20288 -0.076575  0.29039 X
## 75  0 -0.18147  0.019607 -1.31953 K
## 76  0 -0.57495  0.778011 -2.20197 I
## 77  1 -1.69877  0.636596 -0.33592 L
## 78  1 -2.07330  1.766734  2.43636 C
## 79  0  0.29462 -0.991969 -0.66017 B
## 80  1  0.29372 -0.573212  0.46335 C
## 81  0  0.85411 -0.371477 -0.06186 W
## 82  1  0.70678  0.274230  0.14330 K
## 83  0 -0.86584  0.313496 -0.82688 W
## 84  1  0.84311 -1.478058  0.25956 S
## 85  0 -1.11050 -0.501903 -2.30398 H
## 86  1  0.23547  2.010354 -0.88391 R
## 87  1  0.04245 -0.928369 -0.75509 U
## 88  0  1.09768 -1.806275 -0.64789 B
## 89  0 -0.85865  1.339204  0.42920 W
## 90  1  0.49483  1.133309  0.51501 W
## 91  0 -2.17343 -1.207055 -0.43024 D
## 92  0  1.56411  0.560760  1.52356 Y
## 93  1  0.23590 -1.444402 -0.48720 Y
## 94  0 -0.58226 -0.188818 -0.26365 W
## 95  1  0.33818 -0.462813 -0.65003 P
## 96  1 -0.25738  1.953699  1.68336 O
## 97  0  1.15532 -0.168700 -0.48666 S
## 98  0 -0.88605 -0.596704 -0.39284 D
## 99  0  1.03949  0.944495  0.02210 K
## 100 0  0.47307 -0.616859  0.72329 M

This output is fine but kind of inconvenient. It doesn't fit on one screen, we can't modify anything, and - if we had more variables and/or more observations - it would be pretty difficult to anything in this way. Note: Calling the dataframe by name is the same as print-ing it. So mydf is the same as print(mydf). As we already know, we can use summary to see a more compact version of the dataframe:

summary(mydf)
##        a             b                 c                d         
##  Min.   :0.0   Min.   :-2.6425   Min.   :-3.409   Min.   :-2.498  
##  1st Qu.:0.0   1st Qu.:-0.6506   1st Qu.:-0.731   1st Qu.:-0.876  
##  Median :0.0   Median : 0.0067   Median :-0.123   Median :-0.259  
##  Mean   :0.4   Mean   : 0.0081   Mean   :-0.072   Mean   :-0.231  
##  3rd Qu.:1.0   3rd Qu.: 0.7650   3rd Qu.: 0.565   3rd Qu.: 0.406  
##  Max.   :1.0   Max.   : 2.8201   Max.   : 2.041   Max.   : 2.599  
##                                                                   
##        e     
##  W      : 8  
##  Y      : 7  
##  R      : 6  
##  Z      : 6  
##  K      : 5  
##  S      : 5  
##  (Other):63

Now, instead of all the data, we see a five-number summary of the data for numeric or integer variables and a tabulation of mydf$e, which is a factor variable (you can confirm this with class(mydf$e)). We can also use str to see a different kind of compact summary:

str(mydf)
## 'data.frame':    100 obs. of  5 variables:
##  $ a: int  0 0 0 0 0 0 0 1 0 0 ...
##  $ b: num  -0.653 -1.561 -0.883 -0.648 -0.949 ...
##  $ c: num  1.617 1.374 0.561 1.414 -0.964 ...
##  $ d: num  0.359 1.161 1.508 -1.338 -2.297 ...
##  $ e: Factor w/ 26 levels "A","B","C","D",..: 15 21 25 14 5 6 23 20 26 11 ...

This output has the advantage of additionally showing variable classes and the first few values of each variable, but doesn't provide a numeric summary of the data. Thus summary and str complement each other rather than provide duplicate information. Remember, too, that dataframes also carry a “names” attribute, so we can see just the names of our variables using:

names(mydf)
## [1] "a" "b" "c" "d" "e"

This is very important for when a dataframe is very wide (i.e., has large numbers of variables) because even the compact output of summary and str can become unwieldy with more than 20 or so variables.

head and tail

Two frequently neglected functions in R are head and tail. These offer exactly what their names suggest, the top and bottom few values of an object:

head(mydf)
##   a       b       c       d e
## 1 0 -0.6530  1.6173  0.3595 O
## 2 0 -1.5607  1.3743  1.1615 U
## 3 0 -0.8826  0.5611  1.5084 Y
## 4 0 -0.6475  1.4142 -1.3376 N
## 5 0 -0.9492 -0.9640 -2.2966 E
## 6 0  1.1269 -0.6164 -1.9685 F

Note the similarly between these values and those reported in str(mydf).

tail(mydf)
##     a       b       c       d e
## 95  1  0.3382 -0.4628 -0.6500 P
## 96  1 -0.2574  1.9537  1.6834 O
## 97  0  1.1553 -0.1687 -0.4867 S
## 98  0 -0.8861 -0.5967 -0.3928 D
## 99  0  1.0395  0.9445  0.0221 K
## 100 0  0.4731 -0.6169  0.7233 M

Both head and tail accept an additional argument referring to how many values to display:

head(mydf, 2)
##   a      b     c      d e
## 1 0 -0.653 1.617 0.3595 O
## 2 0 -1.561 1.374 1.1615 U
head(mydf, 15)
##    a        b         c       d e
## 1  0 -0.65302  1.617287  0.3595 O
## 2  0 -1.56067  1.374269  1.1615 U
## 3  0 -0.88265  0.561109  1.5084 Y
## 4  0 -0.64753  1.414186 -1.3376 N
## 5  0 -0.94923 -0.964017 -2.2966 E
## 6  0  1.12688 -0.616431 -1.9685 F
## 7  0  1.72761  0.008532 -0.7382 W
## 8  1 -0.29763  1.572682 -0.1963 T
## 9  0 -0.24442  0.053971  2.5985 Z
## 10 0 -0.84921 -0.189399 -1.1335 K
## 11 0  0.11510 -0.043527 -1.8962 B
## 12 1  0.70786  0.024526 -1.0832 S
## 13 0 -0.92021 -3.408887 -0.7029 E
## 14 0  1.13397 -0.029900  0.5554 V
## 15 0  0.04453  0.373467 -0.6180 S

These functions are therefore very helpful for looking quickly at a dataframe. They can also be applied to individual variables inside of a dataframe:

head(mydf$a)
## [1] 0 0 0 0 0 0
tail(mydf$e)
## [1] P O S D K M
## Levels: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

edit and fix

R provides two ways to edit an R dataframe (or matrix) in a spreadsheet like fashion. They look the same, but are different! Both can be used to look at data in a spreadsheet-like way, but editing with them produces drastically different results. Note: One point of confusion is that calling edit or fix on a non-dataframe object opens a completely different text editing window that can be used to modify vectors, functions, etc. If you try to edit or fix something and don't see a spreadsheet, the object you're trying to edit is not rectangular (i.e., not a dataframe or matrix).

edit

The first of these is edit, which opens an R dataframe as a spreadsheet. The data can then be directly edited. When the spreadsheet window is closed, the resulting dataframe is returned to the user (and printed to the console). This is a reminder that it didn't actually change the mydf object. In other words, when we edit a dataframe, we are actually copying the dataframe, changing its values, and then returning it to the console. The original mydf is unchanged. If we want to use this modified dataframe, we need to save it as a new R object.

fix

The second data editing function is fix. This is probably the more intuitive function. Like edit, fix opens the spreadsheet editor. But, when the window is closed, the result is used to replace the dataframe. Thus fix(mydf) replaces mydf with the edited data.

edit and fix can seem like a good idea. And if they are used simply to look at data, they're a great additional tool (along with summary, str, head, tail, and indexing). But (!!!!) using edit and fix are non-reproducible ways of conducting data analysis. If we want to replace values in a dataframe, it is better (from the perspective of reproducible science) to write out the code to perform those replacements so that you or someone else can use them in the future to achieve the same results. So, in short, use edit and fix, but don't abuse them.

Dataframes

When it comes to performing statistical analysis in R, the most important object type is a dataframe. When we load data into R or use R to conduct statistical tests or build models, we want to have our data as a dataframe. A dataframe is actually a special type of list that has some properties that facilitate using it for data analysis.

To create a dataframe, we use the data.frame function:

a <- data.frame(1:3)
a
##   X1.3
## 1    1
## 2    2
## 3    3

This example is a single vector coerced into being a dataframe. Our input vector 1:3 is printed as a column and the dataframe has row names:

rownames(a)
## [1] "1" "2" "3"

And the vector has been automatically given a column name:

colnames(a)
## [1] "X1.3"

Note: We can also see the column names of a dataframe using names:

names(a)
## [1] "X1.3"

Like a matrix, we can see that this dataframe has dimensions:

dim(a)
## [1] 3 1

Which we can observe as row and column dimensions:

nrow(a)
## [1] 3
ncol(a)
## [1] 1

But having a dataframe consisting of one column vector isn't very helpful. In general we want to have multiple columns, where each column is a variable and each row is an observation.

b <- data.frame(1:3, 4:6)
b
##   X1.3 X4.6
## 1    1    4
## 2    2    5
## 3    3    6

You can see the similarity to building a list and indeed if we check whether our dataframe is a list, it is:

is.data.frame(b)
## [1] TRUE
is.list(b)
## [1] TRUE

Our new dataframe b now has two column variables and the same number of rows. The names of the dataframe are assigned automatically, but we can change them:

names(b)
## [1] "X1.3" "X4.6"
names(b) <- c("var1", "var2")
names(b)
## [1] "var1" "var2"

We can also assign names when we create a dataframe, just as we did with a list:

d <- data.frame(var1 = 1:3, var2 = 4:6)
names(d)
## [1] "var1" "var2"
d
##   var1 var2
## 1    1    4
## 2    2    5
## 3    3    6

Dataframe indexing

Indexing dataframes works similarly to both lists and matrices. Even though our dataframe isn't a matrix:

is.matrix(d)
## [1] FALSE

We can still index it in two dimensions like a matrix to extract rows, columns, or elements:

d[1, ]  #' row
##   var1 var2
## 1    1    4
d[, 2]  #' column
## [1] 4 5 6
d[3, 2]  #' element
## [1] 6

Because dataframes are actually lists, we can index them just like we would a list: For example, to get a dataframe containing only our first column variable, we can use single brackets:

d[1]
##   var1
## 1    1
## 2    2
## 3    3

The same result is possible with named indexing:

d["var1"]
##   var1
## 1    1
## 2    2
## 3    3

To get that column variable as a vector instead of a one-column dataframe, we can use double brackets:

d[[1]]
## [1] 1 2 3

And we can also use named indexing as we would in a list:

d[["var1"]]
## [1] 1 2 3
d$var1
## [1] 1 2 3

And, we can combine indexing like we did with a list to get the elements of a column vector:

d[["var2"]][3]
## [1] 6
d$var2[3]
## [1] 6

We can also use - indexing to exclude columns:

d[, -1]
## [1] 4 5 6

or rows:

d[-2, ]
##   var1 var2
## 1    1    4
## 3    3    6

Thus, it is very easy to extract different parts of a dataframe in different ways, depending on what we want to do.

Modifying dataframes

With those indexing rules, it is also very easy to change dataframe elements. For example, to add a column variable, we just need to add a vector with a name:

d$var3 <- 7:9
d
##   var1 var2 var3
## 1    1    4    7
## 2    2    5    8
## 3    3    6    9

If we try to add a vector that is shorter than the number of dataframe rows, recycling is invoked:

d$var4 <- 1
d
##   var1 var2 var3 var4
## 1    1    4    7    1
## 2    2    5    8    1
## 3    3    6    9    1

If we try to add a vector that is longer than the number of dataframe rows, we get a error:

d$var4 <- 1:4
## Error: replacement has 4 rows, data has 3

So even though a dataframe is like a list, it has the restriction that all columns must have the same length.

We can also remove dataframe columns by setting them equal to NULL:

d
##   var1 var2 var3 var4
## 1    1    4    7    1
## 2    2    5    8    1
## 3    3    6    9    1
d$var4 <- NULL
d
##   var1 var2 var3
## 1    1    4    7
## 2    2    5    8
## 3    3    6    9

This permanently removes the column variable from the dataframe and reduces its dimensions. To remove rows, you simply using positional indexing as described above and assign the result as itself:

d
##   var1 var2 var3
## 1    1    4    7
## 2    2    5    8
## 3    3    6    9
d[-2, ]
##   var1 var2 var3
## 1    1    4    7
## 3    3    6    9
d <- d[-2, ]
d
##   var1 var2 var3
## 1    1    4    7
## 3    3    6    9

This highlights an important point. Unless we assign using <-, we are not modifying the dataframe, only changing what is displayed. If we want to preserve a dataframe and a modified version of it, we can simply assign the modified version a new name:

d
##   var1 var2 var3
## 1    1    4    7
## 3    3    6    9
d2 <- d[, -1]

This leaves our original dataframe unchanged:

d
##   var1 var2 var3
## 1    1    4    7
## 3    3    6    9

And gives us a new object reflecting the modified dataframe:

d2
##   var2 var3
## 1    4    7
## 3    6    9

Combining dataframes

Another similarly between dataframes and matrices is that we can bind them columnwise:

e1 <- data.frame(1:3, 4:6)
e2 <- data.frame(7:9, 10:12)
cbind(e1, e2)
##   X1.3 X4.6 X7.9 X10.12
## 1    1    4    7     10
## 2    2    5    8     11
## 3    3    6    9     12

To bind them rowwise, however, the two dataframes need to have matching names:

names(e1) <- names(e2) <- c("Var1", "Var2")
rbind(e1, e2)
##   Var1 Var2
## 1    1    4
## 2    2    5
## 3    3    6
## 4    7   10
## 5    8   11
## 6    9   12

Dataframes can also be combined using the merge function. merge is powerful, but can also be confusing. Let's imagine that our two dataframes contain observations for the same three individuals, but in different orders:

e1$id <- 1:3
e2$id <- c(2, 1, 3)

We should also rename the variables in e2 to show that these are unique variables:

names(e2)[1:2] <- c("Var3", "Var4")

If we use cbind to combine the data, variables from observations in the two dataframes will be mismatched:

cbind(e1, e2)
##   Var1 Var2 id Var3 Var4 id
## 1    1    4  1    7   10  2
## 2    2    5  2    8   11  1
## 3    3    6  3    9   12  3

This is where merge comes in handy because we can specify a by parameter:

e3 <- merge(e1, e2, by = "id")

The result is a single dataframe, with a single id variable and observations from the two dataframes are matched appropriately.

That was a simple example, but what if our dataframes have different (but overlapping) sets of observations.

e4 <- data.frame(Var5 = 10:1, Var6 = c(5:1, 1:5), id = c(1:2, 4:11))
e4
##    Var5 Var6 id
## 1    10    5  1
## 2     9    4  2
## 3     8    3  4
## 4     7    2  5
## 5     6    1  6
## 6     5    1  7
## 7     4    2  8
## 8     3    3  9
## 9     2    4 10
## 10    1    5 11

This new dataframe e4 has two observations common to the previous dataframes (1 and 2) but no observation for 3.

If we merge e3 and e4, what do we get?

merge(e3, e4, by = "id")
##   id Var1 Var2 Var3 Var4 Var5 Var6
## 1  1    1    4    8   11   10    5
## 2  2    2    5    7   10    9    4

The result is all variables (columns) for the two common observations (1 and 2). If we want to include observation 3, we can use:

merge(e3, e4, by = "id", all.x = TRUE)
##   id Var1 Var2 Var3 Var4 Var5 Var6
## 1  1    1    4    8   11   10    5
## 2  2    2    5    7   10    9    4
## 3  3    3    6    9   12   NA   NA

Note: The all.x argument refers to which observations from the first dataframe (e3) we want to preserve.

If we want to include observations 4 to 11, we can use:

merge(e3, e4, by = "id", all.y = TRUE)
##    id Var1 Var2 Var3 Var4 Var5 Var6
## 1   1    1    4    8   11   10    5
## 2   2    2    5    7   10    9    4
## 3   4   NA   NA   NA   NA    8    3
## 4   5   NA   NA   NA   NA    7    2
## 5   6   NA   NA   NA   NA    6    1
## 6   7   NA   NA   NA   NA    5    1
## 7   8   NA   NA   NA   NA    4    2
## 8   9   NA   NA   NA   NA    3    3
## 9  10   NA   NA   NA   NA    2    4
## 10 11   NA   NA   NA   NA    1    5

Note: The all.y argument refers to which observations from the second dataframe (e4) we want to preserve.

Of course, we can preserve both with either:

merge(e3, e4, by = "id", all.x = TRUE, all.y = TRUE)
##    id Var1 Var2 Var3 Var4 Var5 Var6
## 1   1    1    4    8   11   10    5
## 2   2    2    5    7   10    9    4
## 3   3    3    6    9   12   NA   NA
## 4   4   NA   NA   NA   NA    8    3
## 5   5   NA   NA   NA   NA    7    2
## 6   6   NA   NA   NA   NA    6    1
## 7   7   NA   NA   NA   NA    5    1
## 8   8   NA   NA   NA   NA    4    2
## 9   9   NA   NA   NA   NA    3    3
## 10 10   NA   NA   NA   NA    2    4
## 11 11   NA   NA   NA   NA    1    5
merge(e3, e4, by = "id", all = TRUE)
##    id Var1 Var2 Var3 Var4 Var5 Var6
## 1   1    1    4    8   11   10    5
## 2   2    2    5    7   10    9    4
## 3   3    3    6    9   12   NA   NA
## 4   4   NA   NA   NA   NA    8    3
## 5   5   NA   NA   NA   NA    7    2
## 6   6   NA   NA   NA   NA    6    1
## 7   7   NA   NA   NA   NA    5    1
## 8   8   NA   NA   NA   NA    4    2
## 9   9   NA   NA   NA   NA    3    3
## 10 10   NA   NA   NA   NA    2    4
## 11 11   NA   NA   NA   NA    1    5

These two R statements are equivalent.

Note: If we set by=NULL, we get a potentially unexpected result:

merge(e3, e4, by = NULL)
##    id.x Var1 Var2 Var3 Var4 Var5 Var6 id.y
## 1     1    1    4    8   11   10    5    1
## 2     2    2    5    7   10   10    5    1
## 3     3    3    6    9   12   10    5    1
## 4     1    1    4    8   11    9    4    2
## 5     2    2    5    7   10    9    4    2
## 6     3    3    6    9   12    9    4    2
## 7     1    1    4    8   11    8    3    4
## 8     2    2    5    7   10    8    3    4
## 9     3    3    6    9   12    8    3    4
## 10    1    1    4    8   11    7    2    5
## 11    2    2    5    7   10    7    2    5
## 12    3    3    6    9   12    7    2    5
## 13    1    1    4    8   11    6    1    6
## 14    2    2    5    7   10    6    1    6
## 15    3    3    6    9   12    6    1    6
## 16    1    1    4    8   11    5    1    7
## 17    2    2    5    7   10    5    1    7
## 18    3    3    6    9   12    5    1    7
## 19    1    1    4    8   11    4    2    8
## 20    2    2    5    7   10    4    2    8
## 21    3    3    6    9   12    4    2    8
## 22    1    1    4    8   11    3    3    9
## 23    2    2    5    7   10    3    3    9
## 24    3    3    6    9   12    3    3    9
## 25    1    1    4    8   11    2    4   10
## 26    2    2    5    7   10    2    4   10
## 27    3    3    6    9   12    2    4   10
## 28    1    1    4    8   11    1    5   11
## 29    2    2    5    7   10    1    5   11
## 30    3    3    6    9   12    1    5   11

If we leave by blank, the default is to merge based on the common variable names in both dataframes. We can also separately specify by for each dataframe:

merge(e3, e4, by.x = "id", by.y = "id")
##   id Var1 Var2 Var3 Var4 Var5 Var6
## 1  1    1    4    8   11   10    5
## 2  2    2    5    7   10    9    4

This would be helpful if the identifier variable had a different name in each dataframe.

Note: merge only works with two dataframes. So, if multiple dataframes need to be merged, it must be done sequentially:

merge(merge(e1, e2), e4)
##   id Var1 Var2 Var3 Var4 Var5 Var6
## 1  1    1    4    8   11   10    5
## 2  2    2    5    7   10    9    4

Exporting results to Word

This tutorial walks through some basics for how to export results to Word format files or similar.

install.packages(c("rtf"), repos = "http://cran.r-project.org")
## Warning: package 'rtf' is in use and will not be installed

As a running example, let's build a regression model, whose coefficients we want to output:

set.seed(1)
x1 <- runif(100, 0, 1)
x2 <- rbinom(100, 1, 0.5)
y <- x1 + x2 + rnorm(100)
s1 <- summary(lm(y ~ x1))
s2 <- summary(lm(y ~ x1 + x2))

Base R functions

One of the easiest ways to move results from R to Word is simply to copy and paste them. R results are printed in ASCII, though, so the results don't necessarily copy well (e.g., tables tend to lose their formatting unless they're pasted into the Word document using a fixed-width font like Courier). For example, we could just copy print coef(s) to the console and manually copy and paste the results:

round(coef(s2), 2)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)     0.07       0.24    0.28     0.78
## x1              0.92       0.36    2.52     0.01
## x2              0.89       0.19    4.57     0.00

Another, perhaps easier alternative, is to write the results to the console as a comma-separated value (CSV) format:

write.csv(round(coef(s2), 2))
## "","Estimate","Std. Error","t value","Pr(>|t|)"
## "(Intercept)",0.07,0.24,0.28,0.78
## "x1",0.92,0.36,2.52,0.01
## "x2",0.89,0.19,4.57,0

The output doesn't look very pretty in R and it won't look very pretty in Word, either, at least not right away. If you copy the CSV format and paste it into a Word document, it will look like a mess. But, if you select the pasted text, click the “Insert” menu, and press “Table”, a menu will open, one option of which is “Convert text to table…” Clicking this option, and selecting to “Separate text at” commmas, then pressing OK will produce a nicely formatted table, resembling the original R output.

As long as you can convert an R object (or set of R objects) to a table-like structure, you can use write.csv and follow the instructions above to easily move that object into Word. Thus the biggest challenge for writing output to Word format is not the actual output, but the work of building a table-like structure that can easily be output. For example, let's build a nicer-looking results table that includes summary statistics and both of our regression models. First, we'll get all of our relevant statistics and then bind them together into a table:

r1 <- round(coef(s1), 2)
r2 <- round(coef(s2), 2)
# coefficients
c1 <- paste(r1[, 1], " (", r1[, 2], ")", sep = "")
c2 <- paste(r2[, 1], " (", r2[, 2], ")", sep = "")
# summary statistics
sigma <- round(c(s1$sigma, s2$sigma), 2)
rsq <- round(c(s1$adj.r.squared, s2$adj.r.squared), 2)
# sample sizes
n <- c(length(s1$residuals), length(s1$residuals))

Now let's bind this all together into a table and look at the resulting table:

outtab <- rbind(cbind(c(c1, ""), c2), sigma, rsq, n)
colnames(outtab) <- c("Model 1", "Model 2")
rownames(outtab) <- c("Intercept", "x1", "x2", "sigma", "Adj. R-Squared", "n")
outtab
##                Model 1       Model 2      
## Intercept      "0.61 (0.23)" "0.07 (0.24)"
## x1             "0.79 (0.4)"  "0.92 (0.36)"
## x2             ""            "0.89 (0.19)"
## sigma          "1.06"        "0.97"       
## Adj. R-Squared "0.03"        "0.19"       
## n              "100"         "100"

Then we can just write it to the console and follow the directions above to copy it to a nice table in Word:

write.csv(outtab)
## "","Model 1","Model 2"
## "Intercept","0.61 (0.23)","0.07 (0.24)"
## "x1","0.79 (0.4)","0.92 (0.36)"
## "x2","","0.89 (0.19)"
## "sigma","1.06","0.97"
## "Adj. R-Squared","0.03","0.19"
## "n","100","100"

The rtf Package

Another way to output results directly from R to Word is to use the rtf package. This package is designed to write Rich-Text Format (RTF) files, but can also be used to write Word files. It's actually very simple to use. You simply need to have the package create a Word (.doc) or RTF file to write to, then you can add plain text paragraphs or anything that can be structured as a dataframe directly to the file. You can then open the file directly with Word, finding the resulting text, tables, etc. neatly embedded. A basic example pasting our regression coefficient table and the nicer looking table is shown below:

library(rtf)
rtffile <- RTF("rtf.doc")  # this can be an .rtf or a .doc
addParagraph(rtffile, "This is the output of a regression coefficients:\n")
addTable(rtffile, as.data.frame(round(coef(s2), 2)))
addParagraph(rtffile, "\n\nThis is the nicer looking table we made above:\n")
addTable(rtffile, cbind(rownames(outtab), outtab))
done(rtffile)

You can then find the rtf.doc file in your working directory. Open it to take a look at the results. The rtf package also allows you to specify additional options about fonts and the like, making it possible to write a considerable amount of your results directly from R. See ? rtf for full details.

Factors

To extract the (unique) levels of a factor, use levels:

levels(factor(c(1, 2, 3, 2, 3, 2, 3)))
## [1] "1" "2" "3"

Note: the levels of a factor are always character:

class(levels(factor(c(1, 2, 3, 2, 3, 2, 3))))
## [1] "character"

To obtain just the number of levels, use nlevels:

nlevels(factor(c(1, 2, 3, 2, 3, 2, 3)))
## [1] 3

Converting from factor class

If the factor contains only integers, we can use unclass to convert it (back) to an integer class vector:

unclass(factor(c(1, 2, 3, 2, 3, 2, 3)))
## [1] 1 2 3 2 3 2 3
## attr(,"levels")
## [1] "1" "2" "3"

Note: The “levels” attribute is still being reported but the new object is not a factor.

But if the factor contains other numeric values, we can get unexpected results:

unclass(factor(c(1, 2, 1.5)))
## [1] 1 3 2
## attr(,"levels")
## [1] "1"   "1.5" "2"

We might have expected this to produce a numeric vector of the form c(1,2,1.5) Instead, we have obtained an integer class vector of the form c(1,3,2) This is because the factors levels reflect the ordering vector values, not their actual values

We can see this at work if we unclass a factor that was created from a character vector:

unclass(factor(c("a", "b", "a")))
## [1] 1 2 1
## attr(,"levels")
## [1] "a" "b"

The result is an integer vector: c(1,2,1)

This can be especially confusing if we create a factor from a combination of numeric and character elements:

unclass(factor(c("a", "b", 1, 2)))
## [1] 3 4 1 2
## attr(,"levels")
## [1] "1" "2" "a" "b"

The result is an integer vector, c(3,4,1,2), which we can see in several steps: (1) the numeric values are coerced to character

c("a", "b", 1, 2)
## [1] "a" "b" "1" "2"

(2) the levels of the factor are sorted numerically then alphabetically

factor(c("a", "b", 1, 2))
## [1] a b 1 2
## Levels: 1 2 a b

(3) the result is thus a numeric vector, numbered according to the order of factor levels

unclass(factor(c("a", "b", 1, 2)))
## [1] 3 4 1 2
## attr(,"levels")
## [1] "1" "2" "a" "b"

Modifying factors

Changing factors is similar to changing other types of data, but has some unique challenges We can see this if we compare a numeric vector to a factor version of the same data:

a <- 1:4
b <- factor(a)
a
## [1] 1 2 3 4
b
## [1] 1 2 3 4
## Levels: 1 2 3 4

We can see in the way that the two variables are printed that the numeric and factor look different This is also true if we use indexing to see a subset of the vector:

a[1]
## [1] 1
b[1]
## [1] 1
## Levels: 1 2 3 4

If we try to change the value of an item in the numeric vector using positional indexing, there's no problem:

a[1] <- 5
a
## [1] 5 2 3 4

If we try to do the same thing with the factor, we get a warning:

b[1] <- 5
## Warning: invalid factor level, NA generated
b
## [1] <NA> 2    3    4   
## Levels: 1 2 3 4

And the result isn't what we wanted. We get a missing value. This is because 5 wasn't a valid level of our factor. Let's restore our b variable:

b <- factor(1:4)

Then we can add 5 to the levels by simply replacing the current levels with a vector of the current levels and 5:

levels(b) <- c(levels(b), 5)

Our variable hasn't changed, but its available levels have:

b
## [1] 1 2 3 4
## Levels: 1 2 3 4 5

Now we can change the value using positional indexing, just like before:

b[1] <- 5
b
## [1] 5 2 3 4
## Levels: 1 2 3 4 5

And we get the intended result

This can be quite useful if we want to change the label for all values at a given level To see this, we need a vector containing repeated values:

c <- factor(c(1:4, 1:3, 1:2, 1))
c
##  [1] 1 2 3 4 1 2 3 1 2 1
## Levels: 1 2 3 4

There are four levels to c:

levels(c)
## [1] "1" "2" "3" "4"

If we want to change c so that every 2 is now a 5, we can just change the appropriate level This is easy because 2 is the second level, but we'll see a different example below:

levels(c)[2]
## [1] "2"
levels(c)[2] <- 5
levels(c)[2]
## [1] "5"
c
##  [1] 1 5 3 4 1 5 3 1 5 1
## Levels: 1 5 3 4

Now c is contains 5's in place of all of the 2's But our replacement involved positional indexing The second factor level isn't always equal to the number 2, it just depends on what data we have So we can also replace factor levels using logicals (e.g., to change 5 to 9):

levels(c) == "5"
## [1] FALSE  TRUE FALSE FALSE
levels(c)[levels(c) == "5"]
## [1] "5"
levels(c)[levels(c) == "5"] <- 9
levels(c)
## [1] "1" "9" "3" "4"
c
##  [1] 1 9 3 4 1 9 3 1 9 1
## Levels: 1 9 3 4

As you can see, factors are a potentially useful way for storing different kinds of data and R uses them alot!

Heteroskedasticity-Consistent SEs for OLS

We often need to analyze data that fails to satisfy assumptions of the statistical techniques we use. One common violation of assumptions in OLS regression is the assumption of homoskedasticity. This assumption requires that the error term have constant variance across all values of the independent variable(s). When this assumption fails, the standard errors from our OLS regression estimates are inconsistent. But, we can calculate heteroskedasticity-consistent standard errors, relatively easily. Unlike in Stata, where this is simply an option for regular OLS regression, in R, these SEs are not built into the base package, but instead come in an add-on package called sandwich, which we need to install and load:

install.packages("sandwich", repos = "http://cran.r-project.org")
## Warning: package 'sandwich' is in use and will not be installed
library(sandwich)

To see the sandwich package in action, let's generate some heteroskedastic data:

set.seed(1)
x <- runif(500, 0, 1)
y <- 5 * rnorm(500, x, x)

A simple plot of y against x (and the associated regression line) will reveal any heteroskedasticity:

plot(y ~ x, col = "gray", pch = 19)
abline(lm(y ~ x), col = "blue")

plot of chunk unnamed-chunk-3

Clearly, the variance of y and thus of the error term in an OLS model of y~x will increase as x increases.

Now let's run the OLS model and see the results:

ols <- lm(y ~ x)
s <- summary(ols)
s
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12.17  -1.27  -0.15   1.31   9.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.259      0.273    0.95     0.34    
## x              4.241      0.479    8.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.03 on 498 degrees of freedom
## Multiple R-squared:  0.136,  Adjusted R-squared:  0.134 
## F-statistic: 78.5 on 1 and 498 DF,  p-value: <2e-16

It may be particularly helpful to look just as the coefficient matrix from the summary object:

s$coef
##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept)   0.2592     0.2732  0.9488 3.432e-01
## x             4.2414     0.4786  8.8615 1.402e-17

The second column shows the SEs. These SEs are themselves generated from the variance-covariance matrix for the coefficients, which we can see with:

vcov(ols)
##             (Intercept)       x
## (Intercept)     0.07463 -0.1135
## x              -0.11355  0.2291

The variance estimates for the coefficients are on the diagonal:

diag(vcov(ols))
## (Intercept)           x 
##     0.07463     0.22909

To convert these to SEs, we simply take the squared roote:

sqrt(diag(vcov(ols)))
## (Intercept)           x 
##      0.2732      0.4786

Now that we know where the regular SEs are coming from, let's get the heteroskedasticity-consistent SEs for this model from sandwich. The SEs come from the vcovHC function and the resulting object is the variance-covariance matrix for the coefficients:

vcovHC(ols)
##             (Intercept)        x
## (Intercept)     0.03335 -0.08751
## x              -0.08751  0.29242

This is, again, a variance-covariance matrix for the coefficients. So to get SES, we take the square root of the diagonal, like we did above:

sqrt(diag(vcovHC(ols)))
## (Intercept)           x 
##      0.1826      0.5408

We can then compare the SE estimate from the standard formula to the heteroskedasticity-consistent formula:

sqrt(diag(vcov(ols)))
## (Intercept)           x 
##      0.2732      0.4786
sqrt(diag(vcovHC(ols)))
## (Intercept)           x 
##      0.1826      0.5408

One annoying thing about not having the heteroskedasticity-consistent formula built-in is that when we call summary on ols, it prints the default SEs rather than the ones we really want. But, remember, everything in R is an object. So, we can overwrite the default SEs with the heteroskedasticity-consistent SEs quite easily. To do that, let's first look at the structure of our summary object s:

str(s)
## List of 11
##  $ call         : language lm(formula = y ~ x)
##  $ terms        :Classes 'terms', 'formula' length 3 y ~ x
##   .. ..- attr(*, "variables")= language list(y, x)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "y" "x"
##   .. .. .. ..$ : chr "x"
##   .. ..- attr(*, "term.labels")= chr "x"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: 0x000000001c3d67b0> 
##   .. ..- attr(*, "predvars")= language list(y, x)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "y" "x"
##  $ residuals    : Named num [1:500] 0.1231 0.7807 -0.0241 -0.6949 0.5952 ...
##   ..- attr(*, "names")= chr [1:500] "1" "2" "3" "4" ...
##  $ coefficients : num [1:2, 1:4] 0.259 4.241 0.273 0.479 0.949 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "(Intercept)" "x"
##   .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
##  $ aliased      : Named logi [1:2] FALSE FALSE
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "x"
##  $ sigma        : num 3.03
##  $ df           : int [1:3] 2 498 2
##  $ r.squared    : num 0.136
##  $ adj.r.squared: num 0.134
##  $ fstatistic   : Named num [1:3] 78.5 1 498
##   ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
##  $ cov.unscaled : num [1:2, 1:2] 0.00813 -0.01237 -0.01237 0.02497
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "(Intercept)" "x"
##   .. ..$ : chr [1:2] "(Intercept)" "x"
##  - attr(*, "class")= chr "summary.lm"

s is a list, one element of which is coefficients (which we saw above when we first ran our OLS model). The s$coefficients object is a matrix, with four columns, the second of which contains the default standard errors. If we replace those standard errors with the heteroskedasticity-robust SEs, when we print s in the future, it will show the SEs we actually want. Let's see the effect by comparing the current output of s to the output after we replace the SEs:

s
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12.17  -1.27  -0.15   1.31   9.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.259      0.273    0.95     0.34    
## x              4.241      0.479    8.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.03 on 498 degrees of freedom
## Multiple R-squared:  0.136,  Adjusted R-squared:  0.134 
## F-statistic: 78.5 on 1 and 498 DF,  p-value: <2e-16
s$coefficients[, 2] <- sqrt(diag(vcovHC(ols)))
s
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12.17  -1.27  -0.15   1.31   9.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.259      0.183    0.95     0.34    
## x              4.241      0.541    8.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.03 on 498 degrees of freedom
## Multiple R-squared:  0.136,  Adjusted R-squared:  0.134 
## F-statistic: 78.5 on 1 and 498 DF,  p-value: <2e-16

The summary output now reflects the correct SEs. But remember, if we call summary(ols), again, we'll see the original SEs. We need to call our s object to see the updated version.

Interaction Plots

When we have continuous-by-continuous interactions in linear regression, it is impossible to directly interpret the coefficients on the interactions. In fact, it is just generally difficult to interpret these kinds of models. Often, a better approach is to translate one of the continuous variables into a factor and interpet the interaction-term coefficients on each level of that variable. Another approach is to visualize graphically. Both will give us the same inference.

Note: While interaction plots can help make effects interpretable, one of their major downsides is an inability to effectively convey statistical uncertainty. For this reason (and some of the other disadvantages that will become clear below), I would recommend these plots only for data summary but not for inference or prediction, or publication.

Let's start with some fake data:

set.seed(1)
x1 <- runif(100, 0, 1)
x2 <- sample(1:10, 100, TRUE)/10
y <- 1 + 2 * x1 + 3 * x2 + 4 * x1 * x2 + rnorm(100)

We've built a model that has a strong interaction between x1 and x2. We can model this as a continuous interaction:

m <- lm(y ~ x1 * x2)

Alternatively, we can treat x2 as a factor (because, while approximately continuous, it only takes on 10 discrete values):

m2 <- lm(y ~ x1 * factor(x2))

Let's look at the output of both models and see if we can make sense of them:

summary(m)
## 
## Call:
## lm(formula = y ~ x1 * x2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.916 -0.603 -0.109  0.580  2.383 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.616      0.541    2.99   0.0036 ** 
## x1             0.783      0.878    0.89   0.3748    
## x2             1.937      0.865    2.24   0.0274 *  
## x1:x2          5.965      1.370    4.35  3.3e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.962 on 96 degrees of freedom
## Multiple R-squared:  0.802,  Adjusted R-squared:  0.795 
## F-statistic:  129 on 3 and 96 DF,  p-value: <2e-16
summary(m2)
## 
## Call:
## lm(formula = y ~ x1 * factor(x2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2086 -0.5368 -0.0675  0.5007  2.3648 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)        3.5007     5.4427    0.64     0.52
## x1                -2.3644    10.4793   -0.23     0.82
## factor(x2)0.2     -1.5294     5.4711   -0.28     0.78
## factor(x2)0.3     -1.8828     5.5500   -0.34     0.74
## factor(x2)0.4     -1.2069     5.4991   -0.22     0.83
## factor(x2)0.5      0.2490     5.5039    0.05     0.96
## factor(x2)0.6     -0.8439     5.4614   -0.15     0.88
## factor(x2)0.7     -0.7045     5.4917   -0.13     0.90
## factor(x2)0.8     -0.3576     5.4764   -0.07     0.95
## factor(x2)0.9     -0.0365     5.5302   -0.01     0.99
## factor(x2)1       -0.3488     5.5207   -0.06     0.95
## x1:factor(x2)0.2   4.6778    10.5179    0.44     0.66
## x1:factor(x2)0.3   5.5067    10.5979    0.52     0.60
## x1:factor(x2)0.4   5.8601    10.5629    0.55     0.58
## x1:factor(x2)0.5   4.4851    10.6169    0.42     0.67
## x1:factor(x2)0.6   6.2043    10.5215    0.59     0.56
## x1:factor(x2)0.7   7.8928    10.5514    0.75     0.46
## x1:factor(x2)0.8   8.4370    10.5690    0.80     0.43
## x1:factor(x2)0.9   7.9411    10.5961    0.75     0.46
## x1:factor(x2)1     9.6787    10.5511    0.92     0.36
## 
## Residual standard error: 1.01 on 80 degrees of freedom
## Multiple R-squared:  0.819,  Adjusted R-squared:  0.776 
## F-statistic: 19.1 on 19 and 80 DF,  p-value: <2e-16

For our continuous-by-continuous interaction model, we have the interaction expressed as a single number: ~5.96. This doesn't tell us anything useful because its only interpretation is the additionaly expected value of y as an amount added to the intercept plus the coefficients on each covariate (but only for the point at which x1==1 and x2==1). Thus, while we might be inclined to talk about this as an interaction term, it really isn't…it's just a mostly meaningless number. In the second, continuous-by-factor model, things are more interpretable. Here our factor dummies for x2 tell us the expected value of y (if added to the intercept) when x1==0. Similarly, the factor-“interaction” dummies tell us the expected value y (if added to the intercept and the coefficient on x1) when x1==1. These seem more interpretable.

Three-Dimensional Interaction Plotting

Another approach to understanding continuous-by-continuous interaction terms is to plot them. We saw above that the continuous-by-factor model, while intretable, required a lot of numbers (in a large table) to communicate the relationships between x1, x2, and y. R offers a number of plotting functions to visualize these kinds of interaction “response surfaces”.

Let's start by estimating predicted values. Because by x1 and x2 are scaled [0,1], we'll just create a single vector of values on the 0-1 scale and use that for both of our prediction values.

nx <- seq(0, 1, length.out = 10)

The use of the outer function here is, again, a convenience because our input values are scaled [0,1]. Essentially, it builds a 10-by-10 matrix of input values and predicts y for each combination of x1 and x2.

z <- outer(nx, nx, FUN = function(a, b) predict(m, data.frame(x1 = a, x2 = b)))

We can look at the z matrix to see what is going on:

z
##        [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9]  [,10]
##  [1,] 1.616 1.831 2.046 2.262 2.477 2.692 2.907 3.122 3.338  3.553
##  [2,] 1.703 1.992 2.281 2.570 2.858 3.147 3.436 3.725 4.014  4.303
##  [3,] 1.790 2.152 2.515 2.877 3.240 3.602 3.965 4.327 4.690  5.052
##  [4,] 1.877 2.313 2.749 3.185 3.621 4.058 4.494 4.930 5.366  5.802
##  [5,] 1.964 2.474 2.983 3.493 4.003 4.513 5.022 5.532 6.042  6.552
##  [6,] 2.051 2.634 3.218 3.801 4.384 4.968 5.551 6.135 6.718  7.301
##  [7,] 2.138 2.795 3.452 4.109 4.766 5.423 6.080 6.737 7.394  8.051
##  [8,] 2.225 2.955 3.686 4.417 5.147 5.878 6.609 7.339 8.070  8.801
##  [9,] 2.312 3.116 3.920 4.725 5.529 6.333 7.138 7.942 8.746  9.550
## [10,] 2.399 3.277 4.155 5.033 5.910 6.788 7.666 8.544 9.422 10.300

All of the resulting functions require us to use this z matrix as the “height” of the plot at each combination of x1 and x2. Sounds a little crazy, but it will become clear once we do the plotting.

Perspective plots

A perspective plot draws a “response surface” (i.e., the values of the z matrix) across a two-dimensional grid. The plot is what you might typically think of when you hear “three-dimensional graph”. Let's take a look:

persp(nx, nx, z, theta = 45, phi = 10, shade = 0.75, xlab = "x1", ylab = "x2", 
    zlab = "y")

plot of chunk unnamed-chunk-8

Note: The theta parameter refers to the horizontal rotation of the plot and the phi parameter refers to the tilt of the plot (see ?persp). The plot shows us many things, especially: 1. The vertical height of the surface is the expected (predicted) value of y at each combination of x1 and x2. 2. The slope of the surface on each edge of the plot is a marginal effect. In other words, the shallow slope on the lefthand face of the plot is the marginal effect of x1 when x2==0. Similarly, the steep slope on the righthand face of the plot is the marginal effect of x2 when x1==1. The other marginal effects (x1|x2==1 and x2|x1==0) are hidden from our view on the back of the plot.

There are two problems with perspective plots: 1. Because they are two-dimensional representations of three-dimensional objects, their scales are deceiving. Clearly the “height” of the plot is bigger in the front than in the back. It is therefore only a heuristic. 2. Because they are three-dimensional, we cannot see the entire plot at once (as evidence by the two hidden marginal effects discussed above). There is nothing we can do about the first point, unless you want to use a 3D printer to print out the response surface. On the second point, however, we can see different rotations of the plot in order to get a better grasp on the various marginal effects.

Let's look at two different sets of rotations. One showing four plots on the diagonal (like above):

par(mai = rep(0.2, 4))
layout(matrix(1:4, nrow = 2, byrow = TRUE))
s <- sapply(c(45, 135, 225, 315), function(i) persp(nx, nx, z, theta = i, phi = 10, 
    shade = 0.75, xlab = "x1", ylab = "x2", zlab = "y"))

plot of chunk unnamed-chunk-9

The plot in the upper-left corner is the same one we saw above. But now, we see three additional rotations (imagine the plots rotating 90 degrees each, right-to-left), so the lower-right plot highlights the two “hidden” marginal effects from above.

Another set of plots shows the same plot at right angles, thus highlighting the marginal effects at approximately true scale but masking much of the curviture of the response surface:

par(mai = rep(0.2, 4))
layout(matrix(1:4, nrow = 2))
sapply(c(90, 180, 270, 360), function(i) persp(nx, nx, z, theta = i, phi = 10, 
    shade = 0.75, xlab = "x1", ylab = "x2", zlab = "y"))

plot of chunk unnamed-chunk-10

##             [,1]       [,2]       [,3]       [,4]
##  [1,]  1.225e-16 -2.000e+00 -3.674e-16  2.000e+00
##  [2,]  2.000e+00  2.449e-16 -2.000e+00 -4.898e-16
##  [3,] -1.410e-17 -1.727e-33  1.410e-17  3.454e-33
##  [4,] -1.000e+00  1.000e+00  1.000e+00 -1.000e+00
##  [5,] -3.473e-01 -4.253e-17  3.473e-01  8.506e-17
##  [6,]  1.419e-16 -3.473e-01  5.680e-17  3.473e-01
##  [7,]  2.268e-01  2.268e-01  2.268e-01  2.268e-01
##  [8,] -1.178e+00 -1.178e+00 -1.525e+00 -1.525e+00
##  [9,]  1.970e+00  2.412e-16 -1.970e+00 -4.824e-16
## [10,] -9.934e-17  1.970e+00  3.831e-16 -1.970e+00
## [11,]  3.999e-02  3.999e-02  3.999e-02  3.999e-02
## [12,] -3.955e+00 -3.955e+00 -1.986e+00 -1.986e+00
## [13,] -1.970e+00 -2.412e-16  1.970e+00  4.824e-16
## [14,]  9.934e-17 -1.970e+00 -3.831e-16  1.970e+00
## [15,] -3.999e-02 -3.999e-02 -3.999e-02 -3.999e-02
## [16,]  4.955e+00  4.955e+00  2.986e+00  2.986e+00

While this highlights the marginal effects somewhat nicely, the two left-hand plots are quite difficult to actually look at due to the shape of the interaction. Note: The plots can be colored in many interesting ways, but the details are complicated (see ? persp).

Matrix Plots

Because the perspective plots are somewhat difficult to interpret, we might want to produce a two-dimensional representation that better highlights our interaction without the confusion of flattening a three-dimensional surface to two dimensions. The image function supplies us with a way to use color (or grayscale, in the case below) to show the values of y across the x1-by-x2 matrix. We again supply arguments quite similar to above:

layout(1)
par(mai = rep(1, 4))
image(nx, nx, z, xlab = "x1", ylab = "x2", main = "Expected Y", col = gray(50:1/50))

plot of chunk unnamed-chunk-11

Here, the darker colors represent higher values of y. Because the mind can't interpret color differences as well as it can interpret differences in slope, the interaction becomes somewhat muddled. For example, the marginal effect of x2|x1==0 is much less steep than the marginal effect of x2|x1==1, but it is difficult to quantify that by comparing the difference between white and gray on the left-hand side of the plot to the difference between gray and black on the right-hand side of the plot (those differences in color representing the marignal effects. We could redraw the plot with some contour lines to try to better see things:

image(nx, nx, z, xlab = "x1", ylab = "x2", main = "Expected Y", col = gray(50:1/50))
contour(z = z, add = TRUE)

plot of chunk unnamed-chunk-12

Here we see that when x1==0, a change in x2 from 0 to 1 increases y from about 1 to about 4. By contrast, when x1==0, the same change in x2 is from about 3 to about 10, which is substantially larger.

Contour lines

Since the contours seemed to make all of the difference in terms of interpretability above, we could just draw those instead without the underlying image matrix:

filled.contour(z = z, xlab = "x1", ylab = "x2", main = "Expected Y", col = gray(20:1/20))

plot of chunk unnamed-chunk-13

Here we see the same relationship highlighted by the contour lines, but they are nicely scaled and the plot supplies a gradient scale (at right) to help quantify the different colors.

Thus we have several different ways to look at continuous-by-continuous interactions. All of these techniques have advantages and disadvantages, but all do a better job at clarifying the nature of the relationships between x1, x2, and y than does the standard regression model or even the continuous-by-factor model.

Lists

Lists are a very helpful data structure, especially for large projects. Lists allow us to store other types of R objects together inside another object. For example, instead of having two vectors a and b, we could put those vectors in a list:

a <- 1:10
b <- 11:20
x <- list(a, b)
x
## [[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## [[2]]
##  [1] 11 12 13 14 15 16 17 18 19 20

The result is a list with two elements, where the elements are the original vectors. We can also build lists without defining the list elements beforehand:

x <- list(1:5, 6:10)
x
## [[1]]
## [1] 1 2 3 4 5
## 
## [[2]]
## [1]  6  7  8  9 10

Positional indexing of lists

Positional indexing of lists is similar to positional indexing of vectors, with a few important differences. If we index our list x with [], the result is a list:

x[1]
## [[1]]
## [1] 1 2 3 4 5
x[2]
## [[1]]
## [1]  6  7  8  9 10
x[1:2]
## [[1]]
## [1] 1 2 3 4 5
## 
## [[2]]
## [1]  6  7  8  9 10

If we try to index with 0, we get an empty list:

x[0]
## list()

And if we try to index with a value larger than length(x), we get a list with a NULL element:

length(x)
## [1] 2
x[length(x) + 1]
## [[1]]
## NULL

Lists also allow us to use a different kind of positional indexing involving two brackets (e.g., [[]]):

x[[1]]
## [1] 1 2 3 4 5

Rather than returning a list, this returns the vector that is stored in list element 1. We aren't allowed to index like x[[1:2]] because R doesn't know we want the first and second vectors combined.

The double bracket indexing also lets us index elements of the vector stored in a list element. For example, if we want to get the third element of the second list item, we can use two sets of indices:

x[[2]][3]
## [1] 8

Named indexing of lists

Just like vectors, list elements can have names.

y <- list(first = 4:6, second = 7:9, third = 1:3)
y
## $first
## [1] 4 5 6
## 
## $second
## [1] 7 8 9
## 
## $third
## [1] 1 2 3

The result is a list with three named elements, each of which is a vector. We can still index this list positionally:

y[1]
## $first
## [1] 4 5 6
y[[3]]
## [1] 1 2 3

But we can also index by the names, like we did with vectors. This can involve single bracket indexing, to return a single-element list:

y["first"]
## $first
## [1] 4 5 6

Or a subset of the original list elements:

y[c("first", "third")]
## $first
## [1] 4 5 6
## 
## $third
## [1] 1 2 3

It can also involve double bracket indexing, to return a vector:

y[["second"]]
## [1] 7 8 9

We can then combine this named indexing of the list with the numeric indexing of one of the list's vectors:

y[["second"]][3]
## [1] 9

Named indexing also allows us to use a new operator, the dollar sign ($). The $ sign is equivalent to named indexing:

y[["first"]]
## [1] 4 5 6
y$first
## [1] 4 5 6

And, just with named indexing in double brackets, we can combine $ indexing with vector positional indexing:

y[["first"]][2]
## [1] 5
y$first[2]
## [1] 5

Modifying list elements

We can easily modify the elements of a list using positional or named indexing.

w <- list(a = 1:5, b = 6:10)
w
## $a
## [1] 1 2 3 4 5
## 
## $b
## [1]  6  7  8  9 10
w[[1]] <- 5:1
w
## $a
## [1] 5 4 3 2 1
## 
## $b
## [1]  6  7  8  9 10
w[["a"]] <- rep(1, 5)
w
## $a
## [1] 1 1 1 1 1
## 
## $b
## [1]  6  7  8  9 10

We can also add new elements to a list using positions or names:

w[[length(w) + 1]] <- 1
w$d <- 2
w
## $a
## [1] 1 1 1 1 1
## 
## $b
## [1]  6  7  8  9 10
## 
## [[3]]
## [1] 1
## 
## $d
## [1] 2

The result is a list with some named and some unnamed elements:

names(w)
## [1] "a" "b" ""  "d"

We can fill in the empty ('') name:

names(w)[3] <- "c"
names(w)
## [1] "a" "b" "c" "d"
w
## $a
## [1] 1 1 1 1 1
## 
## $b
## [1]  6  7  8  9 10
## 
## $c
## [1] 1
## 
## $d
## [1] 2

Or we could change all the names entirely:

names(w) <- c("do", "re", "mi", "fa")
names(w)
## [1] "do" "re" "mi" "fa"
w
## $do
## [1] 1 1 1 1 1
## 
## $re
## [1]  6  7  8  9 10
## 
## $mi
## [1] 1
## 
## $fa
## [1] 2

Lists are flexible and therefore important! The above exercises also showed that lists can contain different kinds of elements. Not every element in a list has to be the same length or the same class. Indeed, we can create a list that mixes many kinds of elements:

m <- list(a = 1, b = 1:5, c = "hello", d = factor(1:3))
m
## $a
## [1] 1
## 
## $b
## [1] 1 2 3 4 5
## 
## $c
## [1] "hello"
## 
## $d
## [1] 1 2 3
## Levels: 1 2 3

This is important because many of the functions we will use to do analysis in R return lists with different kinds of information. To really use R effectively, we need to be able to extract information from those resulting lists.

Converting a list to a vector (and back)

It may at some point be helpful to have our list in the form of a vector. For example, we may want to be able to see all of the elements of every vector in the list as a single vector: To get this, we unlist the list, which converts it into a vector and automatically names the vector elements according to the names of the original list:

z1 <- unlist(y)
z1
##  first1  first2  first3 second1 second2 second3  third1  third2  third3 
##       4       5       6       7       8       9       1       2       3

We could also turn this back into a list, with every element of unlist(y) being a separate element of a new list:

z2 <- as.list(z1)
z2
## $first1
## [1] 4
## 
## $first2
## [1] 5
## 
## $first3
## [1] 6
## 
## $second1
## [1] 7
## 
## $second2
## [1] 8
## 
## $second3
## [1] 9
## 
## $third1
## [1] 1
## 
## $third2
## [1] 2
## 
## $third3
## [1] 3

Here all of the elements of the vector are separate list elements and vector names are transferred to the new list. We can see that the names of the vector are the same as the names of the list:

names(z1)
## [1] "first1"  "first2"  "first3"  "second1" "second2" "second3" "third1" 
## [8] "third2"  "third3"
names(z2)
## [1] "first1"  "first2"  "first3"  "second1" "second2" "second3" "third1" 
## [8] "third2"  "third3"

Loading Data

In order to use R for data analysis, we need to get our data into R. Unfortunately, because R lacks a graphical user interface, loading data is not particularly intuitive for those used to working with other statistical software. This tutorial explains how to load data into R as a dataframe object.

General Notes

As a preliminary note, one of the things about R that causes a fair amount of confusion is that R reads character data, by default, as factor. In other words, when your data contain alphanumeric character strings (e.g., names of countries, free response survey questions), R will read those data in as factor variables rather than character variables. This can be changed when reading in data using almost any of the following techniques by setting a stringsAsFactors=FALSE argument.

A second point of difficulty for beginners to R is that R offers no obvious visual way to load data into R. Lacking a full graphical user interface, there is no “open” button to read in a dataset. The closest thing to this is the file.choose function. If you don't know the name or location of a file you want to load, you can use file.choose() to open a dialog window that will let you select a file. The response, however, is just a character string containing the name and full path of the file. No action is taken with regard to that file. If, for example, you want to load a comma-separated value file (described below), you could make a call like the following:

# read.csv(file.choose())

This will first open the file choose dialog window and, when you select a file, R will then process that file with read.csv and return a dataframe. While file.choose is a convenient function for interactively working with R. It is generally better to manually write filenames into your code to maximize reproducibility.

Built-in Data

One of the neat little features of R is that it comes with some built-in datasets, and many add-on packages supply additional datasets to demonstrate their functionality. We can access these datasets with the data() function. Here we'll just print the first few datasets:

head(data()$results)
##      Package LibPath                              Item       
## [1,] "car"   "C:/Program Files/R/R-3.0.2/library" "AMSsurvey"
## [2,] "car"   "C:/Program Files/R/R-3.0.2/library" "Adler"    
## [3,] "car"   "C:/Program Files/R/R-3.0.2/library" "Angell"   
## [4,] "car"   "C:/Program Files/R/R-3.0.2/library" "Anscombe" 
## [5,] "car"   "C:/Program Files/R/R-3.0.2/library" "Baumann"  
## [6,] "car"   "C:/Program Files/R/R-3.0.2/library" "Bfox"     
##      Title                                        
## [1,] "American Math Society Survey Data"          
## [2,] "Experimenter Expectations"                  
## [3,] "Moral Integration of American Cities"       
## [4,] "U. S. State Public-School Expenditures"     
## [5,] "Methods of Teaching Reading Comprehension"  
## [6,] "Canadian Women's Labour-Force Participation"

Datasets in the dataset package are pre-loaded with R and can simply be called by name from the R console. For example, we can see the “Monthly Airline Passenger Numbers 1949-1960” dataset by simply calling:

AirPassengers
##      Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119 104 118
## 1950 115 126 141 135 125 149 170 170 158 133 114 140
## 1951 145 150 178 163 172 178 199 199 184 162 146 166
## 1952 171 180 193 181 183 218 230 242 209 191 172 194
## 1953 196 196 236 235 229 243 264 272 237 211 180 201
## 1954 204 188 235 227 234 264 302 293 259 229 203 229
## 1955 242 233 267 269 270 315 364 347 312 274 237 278
## 1956 284 277 317 313 318 374 413 405 355 306 271 306
## 1957 315 301 356 348 355 422 465 467 404 347 305 336
## 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 1959 360 342 406 396 420 472 548 559 463 407 362 405
## 1960 417 391 419 461 472 535 622 606 508 461 390 432

To obtain detailed information about the datasets, you can just access the dataset documention: ? AirPassengers. We generally want to work with our own data, however, rather than some arbitrary dataset, so we'll have to load data into R.

Manual data entry

Because a dataframe is just a collection of data vectors, we can always enter data by hand into the R console. For example, let's say we have two variables (height and weight) measured on each of six observations. We can enter these simply by typing them into the console and combining them into a dataframe, like:

height <- c(165, 170, 163, 182, 175, 190)
weight <- c(45, 60, 70, 80, 63, 72)
mydf <- cbind.data.frame(height, weight)

We can then call our dataframe by name:

mydf
##   height weight
## 1    165     45
## 2    170     60
## 3    163     70
## 4    182     80
## 5    175     63
## 6    190     72

R also provides a function called scan that allows us to type data into a special prompt. For example, we might want to read in six values of gender for our observations above and we could do that by typing mydf$gender <- scan(n=6, what="numeric") and entering the six values, one per line when prompted. But entering data manually in this fashion is inefficient and doesn't make sense if we already have data saved in an external file.

Loading tabular data

The easiest data to load into R comes in tabular file formats, like comma-separated value (CSV) or tab-separated value (TSV) files. These can easily be created using a spreadsheet editor (like Microsoft Excel), a text editor (like Notepad), or exported from many other computer programs (including all statistical packages).

read.table and its variants

The general function for reading these kinds of data is called read.table. Two other functions, read.csv and read.delim, provide convenient wrappers for reading CSV and TSV files, respectively. (Note: read.csv2 and read.delim2 provide slightly different wrappers designed for reading data that uses a semicolon rather than comma separator and a comma rather than a period as the decimal point.) Reading in data that is in CSV format is easy. For example, let's read in the following file, which contains some data about patient admissions for five patients:

patient,dob,entry,discharge,fee,sex
001,10/21/1946,12/12/2004,12/14/2004,8000,1
002,05/01/1980,07/08/2004,08/08/2004,12000,2
003,01/01/1960,01/01/2004,01/04/2004,9000,2
004,06/23/1998,11/11/2004,12/25/2004,15123,1

We can read these data in from from the console by copying and pasting them into a command like the following:

mydf <- read.csv(text = "\npatient,dob,entry,discharge,fee,sex\n001,10/21/1946,12/12/2004,12/14/2004,8000,1\n002,05/01/1980,07/08/2004,08/08/2004,12000,2\n003,01/01/1960,01/01/2004,01/04/2004,9000,2\n004,06/23/1998,11/11/2004,12/25/2004,15123,1")
mydf
##   patient        dob      entry  discharge   fee sex
## 1       1 10/21/1946 12/12/2004 12/14/2004  8000   1
## 2       2 05/01/1980 07/08/2004 08/08/2004 12000   2
## 3       3 01/01/1960 01/01/2004 01/04/2004  9000   2
## 4       4 06/23/1998 11/11/2004 12/25/2004 15123   1

Or, we can read them from the local file directly:

mydf <- read.csv("../Data/patient.csv")

Reading them in either way will produce the exact same dataframe. If the data were tab- or semicolon-separated, the call would be exactly the same except for the use of read.delim and read.csv2, respectively.

Note: Any time we read data into R, we need to store it as a variable, otherwise it will simply be printed to the console and we won't be able to do anything with it. You can name dataframes whatever you want.

scan and readLines

Occasionally, we need to read in data as a vector of character strings rather than as delimited data to make a dataframe. For example, we might have a file that contains textual data (e.g., from a news story) and we want to read in each word or each line of the file as a separate element of a vector in order to perform some kind of text processing on it. To do this kind of analysis we can use one of two functions. The scan function we used above to manually enter data at the console can also be used to read data in from a file, as can another function called readLines. We can see how the two functions work by first writing some miscellaneous text to a file (using cat) and then reading in that content:

cat("TITLE", "A first line of text", "A second line of text", "The last line of text", 
    file = "ex.data", sep = "\n")

We can use scan to read in the data as a vector of words:

scan("ex.data", what = "character")
##  [1] "TITLE"  "A"      "first"  "line"   "of"     "text"   "A"     
##  [8] "second" "line"   "of"     "text"   "The"    "last"   "line"  
## [15] "of"     "text"

The scan function accepts additional arguments such n to specify the number of lines to read from the file and sep to specify how to divide the file into separate entries in the resulting vector:

scan("ex.data", what = "character", sep = "\n")
## [1] "TITLE"                 "A first line of text"  "A second line of text"
## [4] "The last line of text"
scan("ex.data", what = "character", n = 1, sep = "\n")
## [1] "TITLE"

We can do the same thing with readLines, which assumes that we want to read each line as a complete string rather than separating the file contents in some way:

readLines("ex.data")
## [1] "TITLE"                 "A first line of text"  "A second line of text"
## [4] "The last line of text"

It also accepts an n argument:

readLines("ex.data", n = 2)
## [1] "TITLE"                "A first line of text"

Let's delete the file we created just to cleanup:

unlink("ex.data")  # tidy up

Reading .RData data

R has its own fill format called .RData that can be used to store data for use in R. It is fairly rare to encounter data in this format, but reading it into R is - as one might expect - very easy. You simply need to call load('thefile.RData') and the objects stored in the file will be loaded into memory in R. One context in which you might use an .RData file is when saving your R workspace. When you quite R (using q()), R asks if you want to save your workspace. If you select “yes”, R stores all of the objects currently in memory to a .RData file. This file can then be loaded in a subsequent R session to pick up quite literally exactly where you left off when you saved the file.

Loading “Foreign” data

Because many people use statistical packages like SAS, SPSS, and Stata for statistical analysis, much of the data available in the world is saved in proprietary file formats created and owned by the the companies that publish that software. This is bad because those data formats are deprecated (i.e., made irrelevant) quite often (e.g., when Stata upgraded to version 11, it introduced a new file format and its older file formats were no longer compatible with the newest version of the software). This creates problems for reproducibility because not everyone has access to Stata (or to SPSS or SAS) and storing data in these formats makes it harder to share data and ties data to specific software owned by specific companies. Editorializing aside, R can import data from a variety of proprietary file formats. Doing so requires one of the recommended add-on packages called foreign. Let's load it here:

library(foreign)

The foreign package can be used to import data from a variety of proprietary formats, including Stata .dta formats (using the read.dta function), Octave or Matlab .mat formats (using read.octave), SPSS .sav formats (usingread.spss), SAS permanent .sas7bdat formats (usingread.ssd) and SAS XPORT .stx or .xpt formats (usingread.xport), Systat .syd formats (usingread.systat), and Minitab .tmp formats (usingread.mtp). Note: The **foreign** package sometimes has trouble with SPSS formats, but these files can also be opened with thespss.getfunction from the **Hmisc** package or one of several functions from the **memisc** package (spss.fixed.file,spss.portable.file, andspss.system.file). We can try loading some “foreign” data stored in Stata format:

englebert <- read.dta("../Data/EnglebertPRQ2000.dta")
## Warning: cannot read factor labels from Stata 5 files

We can then look at the loaded data using any of our usual object examination functions:

dim(englebert)  # dimensions
## [1] 50 27
head(englebert)  # first few rows
##        country wbcode indep paris london brussels lisbon commit exprop
## 1       ANGOLA    AGO  1975     0      0        0      1  3.820   5.36
## 2        BENIN    BEN  1960     1      0        0      0  4.667   6.00
## 3     BOTSWANA    BWA  1966     0      1        0      0  6.770   7.73
## 4 BURKINA FASO    BFA  1960     1      0        0      0  5.000   4.45
## 5      BURUNDI    BDI  1962     0      0        1      0  6.667   7.00
## 6     CAMEROON    CMR  1960     1      0        0      0  6.140   6.45
##   corrupt instqual buroqual goodgov ruleolaw pubadmin     growth  lcon
## 1   5.000   2.7300    4.470   4.280    3.970     4.73 -0.0306405 6.594
## 2   1.333   3.0000    2.667   3.533    4.556     2.00 -0.0030205 6.949
## 3   6.590   8.3300    6.140   7.110    7.610     6.36  0.0559447 6.358
## 4   6.060   5.3000    4.170   5.000    4.920     5.11 -0.0000589 6.122
## 5   3.000   0.8333    4.000   4.300    4.833     3.50 -0.0036746 6.461
## 6   4.240   4.5500    6.670   5.610    5.710     5.45  0.0147910 6.463
##   lconsq      i     g vlegit hlegit  elf hieafvm hieafvs warciv language
## 1  43.49  3.273 34.22      0 0.5250 0.78    1.00    0.00     24      4.2
## 2  48.29  6.524 22.79      0 0.6746 0.62    2.67    0.47      0      5.3
## 3  40.42 22.217 27.00      1 0.9035 0.51    2.00    0.00      0      3.1
## 4  37.48  7.858 17.86      0 0.5735 0.68    1.25    0.97      0      4.8
## 5  41.75  4.939 13.71      1 0.9800 0.04    3.00    0.00      8      0.6
## 6  41.77  8.315 20.67      0 0.8565 0.89    1.50    0.76      0      8.3
names(englebert)  # column/variable names
##  [1] "country"  "wbcode"   "indep"    "paris"    "london"   "brussels"
##  [7] "lisbon"   "commit"   "exprop"   "corrupt"  "instqual" "buroqual"
## [13] "goodgov"  "ruleolaw" "pubadmin" "growth"   "lcon"     "lconsq"  
## [19] "i"        "g"        "vlegit"   "hlegit"   "elf"      "hieafvm" 
## [25] "hieafvs"  "warciv"   "language"
str(englebert)  # object structure
## 'data.frame':    50 obs. of  27 variables:
##  $ country : chr  "ANGOLA" "BENIN" "BOTSWANA" "BURKINA FASO" ...
##  $ wbcode  : chr  "AGO" "BEN" "BWA" "BFA" ...
##  $ indep   : num  1975 1960 1966 1960 1962 ...
##  $ paris   : num  0 1 0 1 0 1 0 1 1 1 ...
##  $ london  : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ brussels: num  0 0 0 0 1 0 0 0 0 0 ...
##  $ lisbon  : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ commit  : num  3.82 4.67 6.77 5 6.67 ...
##  $ exprop  : num  5.36 6 7.73 4.45 7 ...
##  $ corrupt : num  5 1.33 6.59 6.06 3 ...
##  $ instqual: num  2.73 3 8.33 5.3 0.833 ...
##  $ buroqual: num  4.47 2.67 6.14 4.17 4 ...
##  $ goodgov : num  4.28 3.53 7.11 5 4.3 ...
##  $ ruleolaw: num  3.97 4.56 7.61 4.92 4.83 ...
##  $ pubadmin: num  4.73 2 6.36 5.11 3.5 ...
##  $ growth  : num  -3.06e-02 -3.02e-03 5.59e-02 -5.89e-05 -3.67e-03 ...
##  $ lcon    : num  6.59 6.95 6.36 6.12 6.46 ...
##  $ lconsq  : num  43.5 48.3 40.4 37.5 41.8 ...
##  $ i       : num  3.27 6.52 22.22 7.86 4.94 ...
##  $ g       : num  34.2 22.8 27 17.9 13.7 ...
##  $ vlegit  : num  0 0 1 0 1 0 1 0 0 0 ...
##  $ hlegit  : num  0.525 0.675 0.904 0.573 0.98 ...
##  $ elf     : num  0.78 0.62 0.51 0.68 0.04 ...
##  $ hieafvm : num  1 2.67 2 1.25 3 ...
##  $ hieafvs : num  0 0.47 0 0.97 0 ...
##  $ warciv  : num  24 0 0 0 8 0 0 0 29 0 ...
##  $ language: num  4.2 5.3 3.1 4.8 0.6 ...
##  - attr(*, "datalabel")= chr ""
##  - attr(*, "time.stamp")= chr "25 Mar 2000 18:07"
##  - attr(*, "formats")= chr  "%21s" "%9s" "%9.0g" "%9.0g" ...
##  - attr(*, "types")= int  148 133 102 102 102 102 102 102 102 102 ...
##  - attr(*, "val.labels")= chr  "" "" "" "" ...
##  - attr(*, "var.labels")= chr  "Name of country" "World Bank three-letter code" "Date of independence" "Colonization by France" ...
##  - attr(*, "version")= int 5
summary(englebert)  # summary
##    country             wbcode              indep          paris     
##  Length:50          Length:50          Min.   :  -4   Min.   :0.00  
##  Class :character   Class :character   1st Qu.:1960   1st Qu.:0.00  
##  Mode  :character   Mode  :character   Median :1962   Median :0.00  
##                                        Mean   :1921   Mean   :0.38  
##                                        3rd Qu.:1968   3rd Qu.:1.00  
##                                        Max.   :1993   Max.   :1.00  
##                                        NA's   :2                    
##      london        brussels        lisbon        commit         exprop    
##  Min.   :0.00   Min.   :0.00   Min.   :0.0   Min.   :1.68   Min.   :2.00  
##  1st Qu.:0.00   1st Qu.:0.00   1st Qu.:0.0   1st Qu.:4.00   1st Qu.:4.50  
##  Median :0.00   Median :0.00   Median :0.0   Median :5.00   Median :6.05  
##  Mean   :0.34   Mean   :0.06   Mean   :0.1   Mean   :4.94   Mean   :5.90  
##  3rd Qu.:1.00   3rd Qu.:0.00   3rd Qu.:0.0   3rd Qu.:6.04   3rd Qu.:6.88  
##  Max.   :1.00   Max.   :1.00   Max.   :1.0   Max.   :8.00   Max.   :9.33  
##                                              NA's   :7      NA's   :7     
##     corrupt        instqual        buroqual         goodgov    
##  Min.   :0.00   Min.   :0.833   Min.   : 0.667   Min.   :1.95  
##  1st Qu.:3.00   1st Qu.:3.180   1st Qu.: 3.130   1st Qu.:3.99  
##  Median :4.39   Median :3.790   Median : 3.940   Median :4.87  
##  Mean   :4.38   Mean   :4.154   Mean   : 4.239   Mean   :4.72  
##  3rd Qu.:5.79   3rd Qu.:5.340   3rd Qu.: 5.300   3rd Qu.:5.53  
##  Max.   :8.71   Max.   :8.330   Max.   :10.000   Max.   :7.40  
##  NA's   :7      NA's   :7       NA's   :7        NA's   :7     
##     ruleolaw       pubadmin        growth            lcon     
##  Min.   :2.33   Min.   :1.25   Min.   :-0.038   Min.   :5.53  
##  1st Qu.:4.33   1st Qu.:3.25   1st Qu.:-0.005   1st Qu.:6.32  
##  Median :5.02   Median :4.17   Median : 0.002   Median :6.60  
##  Mean   :5.00   Mean   :4.31   Mean   : 0.004   Mean   :6.67  
##  3rd Qu.:5.97   3rd Qu.:5.49   3rd Qu.: 0.013   3rd Qu.:7.01  
##  Max.   :7.61   Max.   :9.36   Max.   : 0.056   Max.   :8.04  
##  NA's   :7      NA's   :7      NA's   :6        NA's   :6     
##      lconsq           i               g            vlegit     
##  Min.   :30.6   Min.   : 1.40   Min.   :11.1   Min.   :0.000  
##  1st Qu.:39.9   1st Qu.: 5.41   1st Qu.:18.7   1st Qu.:0.000  
##  Median :43.5   Median : 9.86   Median :22.9   Median :0.000  
##  Mean   :44.8   Mean   :10.25   Mean   :23.9   Mean   :0.213  
##  3rd Qu.:49.1   3rd Qu.:14.32   3rd Qu.:27.8   3rd Qu.:0.000  
##  Max.   :64.6   Max.   :25.62   Max.   :44.2   Max.   :1.000  
##  NA's   :6      NA's   :6       NA's   :6      NA's   :3      
##      hlegit           elf           hieafvm        hieafvs     
##  Min.   :0.000   Min.   :0.040   Min.   :0.67   Min.   :0.000  
##  1st Qu.:0.330   1st Qu.:0.620   1st Qu.:1.52   1st Qu.:0.000  
##  Median :0.582   Median :0.715   Median :1.84   Median :0.480  
##  Mean   :0.572   Mean   :0.651   Mean   :1.86   Mean   :0.503  
##  3rd Qu.:0.850   3rd Qu.:0.827   3rd Qu.:2.00   3rd Qu.:0.790  
##  Max.   :1.000   Max.   :0.930   Max.   :3.00   Max.   :1.490  
##  NA's   :4       NA's   :12      NA's   :12     NA's   :12     
##      warciv        language    
##  Min.   : 0.0   Min.   : 0.10  
##  1st Qu.: 0.0   1st Qu.: 1.90  
##  Median : 0.0   Median : 4.00  
##  Mean   : 6.2   Mean   : 6.53  
##  3rd Qu.: 8.0   3rd Qu.: 8.30  
##  Max.   :38.0   Max.   :27.70  
##                 NA's   :9

If you ever encounter trouble importing foreign data formats into R, a good option is to use a piece of software called StatTransfer, which can convert between dozens of different file formats. Using StatTransfer to convert a file format into a CSV or R .RData format will essentially guarantee that it is readable by R.

Reading Excel files

Sometimes we need to read data in from Excel. In almost every situation, it is easiest to use Excel to convert this kind of file into a comma-separated CSV file first and then load it into R using read.csv. That said, there are several packages designed to read Excel foramts directly, but all have disadvantages.

Notes on other data situations

Sometimes one encounters data in formats that are neither traditional, text-based tabular formats (like CSV or TSV) or proprietary statistical formats (like .dta, .sav, etc.). For example, you sometimes encounter data that is recorded in an XML markup format or that is saved in “fixed-width format”, and so forth. So long as the data is human-readable (i.e., text), you will be able to find or write R code to deal with these files and convert them to an R dataframe. Depending on the file format, this may be time consuming, but everything is possible.

XML files can easily be read using the XML package. Indeed, its functions xmlToDataFrame and xmlToList easily convert almost any well-formed XML document into a dataframe or list, respectively.

Fixed-width file formats are some of the hardest file formats to deal with. These files, typically built during the 20th Century, are digitized versions of data that was originally stored on punch cards. For example, much of the pre-2000 public opinion data archived at the Roper Center for Public Opinion Research's iPoll databank is stored in fixed width format. These formats store data as rows of numbers without variable names, value delimiters (like the comma or tab), and require a detailed codebook to translate them into human- or computer-readable data. For example, the following 14 lines represent the first two records of a public opinion data file from 1998:

000003204042898                    248 14816722  1124 13122292122224442 2 522  1
0000032222222444444444444444144444444424424                                    2
000003          2     1    1    2    312922 3112422222121222          42115555 3
00000355554115           553722211212221122222222352   42       4567   4567    4
000003108 41 52 612211                    1                229                 5
000003                                                                         6
000003    20                                                01.900190 0198     7
000012212042898                    248 14828523  1113 1312212111111411142 5213 1
0000122112221111141244412414114224444444144                                    2
000012          1     2    1    2    11212213123112232322113          31213335 3
00001255333115           666722222222221122222226642   72       4567   4567    4
000012101261 511112411                    1                212                 5
000012                                                                         6
000012    32                                                01.630163 0170     7

Clearly, these data are not easily interpretable despite the fact that there is some obvious pattern to the data. As long as we have a file indicating what each number means, we can use the read.fwf function (from base R) to translate this file into a dataframe. The code is tedious, so there isn't space to demonstrate it here, but know that it is possible.

Local Regression (LOESS)

Sometimes we have bivariate data that are not well represented by a linear function (even if the variables are transformed). We might be able to see a relationship between the data in a scatterplot, but we are unable to fit a parametric model that properly describes the relationship between outcome and predictor. This might be particularly common when our predictor is a time variable and the outcome is a time-series. In these situations, one way to grasp and convey the relationship is with “local regression,” which fits a nonparametric curve to a scatterplot. Note: Local regression also works in multivariate contexts, but we'll focus on the bivariate form here for sake of simplicity.

Let's create a simple bivariate relationship and a complex one to see how local regression works in both cases.

set.seed(100)
x <- sample(1:50, 100, TRUE)
y1 <- 2 * x + rnorm(50, 0, 10)
y2 <- 5 + rnorm(100, 5 * abs(x - 25), abs(x - 10) + 10)

Fitting and visualizing local regressions

We can fit the local regression using the loess function, which takes a formula object as its argument, just like any other regression:

localfit <- loess(y1 ~ x)

We can look at the summary of the localfit object, but - unlike parametric regression methods - the summary won't tell us much.

summary(localfit)
## Call:
## loess(formula = y1 ~ x)
## 
## Number of Observations: 100 
## Equivalent Number of Parameters: 4.51 
## Residual Standard Error: 12 
## Trace of smoother matrix: 4.92 
## 
## Control settings:
##   normalize:  TRUE 
##   span       :  0.75 
##   degree   :  2 
##   family   :  gaussian
##   surface  :  interpolate      cell = 0.2

Local regression doesn't produce coefficients, so there's no way to see the model in tabular form. Instead we have to look at its predicted values and plot them visually.

We can calculate predicted values at each possible value of x:

localp <- predict(localfit, data.frame(x = 1:50), se = TRUE)

The result is a vector of predicted values:

localp
## $fit
##      1      2      3      4      5      6      7      8      9     10 
##     NA  1.817  3.919  6.025  8.136 10.255 12.383 14.522 16.675 18.843 
##     11     12     13     14     15     16     17     18     19     20 
## 21.035 23.249 25.475 27.699 29.910 32.156 34.459 36.770 39.038 41.213 
##     21     22     23     24     25     26     27     28     29     30 
## 43.297 45.333 47.332 49.304 51.261 53.158 54.964 56.707 58.418 60.124 
##     31     32     33     34     35     36     37     38     39     40 
## 61.856 63.577 65.254 66.916 68.595 70.319 72.120 73.977 75.854 77.754 
##     41     42     43     44     45     46     47     48     49     50 
## 79.680 81.636 83.624 85.649 87.711 89.808 91.937 94.099 96.292 98.515 
## 
## $se.fit
##     1     2     3     4     5     6     7     8     9    10    11    12 
##    NA 5.600 4.910 4.287 3.736 3.261 2.867 2.557 2.331 2.180 2.092 2.048 
##    13    14    15    16    17    18    19    20    21    22    23    24 
## 2.037 2.046 2.066 2.075 2.083 2.123 2.190 2.239 2.230 2.214 2.249 2.327 
##    25    26    27    28    29    30    31    32    33    34    35    36 
## 2.372 2.329 2.252 2.209 2.230 2.289 2.323 2.295 2.249 2.228 2.245 2.278 
##    37    38    39    40    41    42    43    44    45    46    47    48 
## 2.280 2.248 2.208 2.168 2.137 2.128 2.158 2.244 2.409 2.668 3.024 3.475 
##    49    50 
## 4.014 4.634 
## 
## $residual.scale
## [1] 12.04
## 
## $df
## [1] 94.97

To see the loess curve, we can simply plot the fitted values. We'll do something a little more interesting though. We'll start by plotting our original data (in blue), then plot the standard errors as polygons (using the polygon) function (for 1-, 2-, and 3-SEs), then overlay the fitted loess curve in white. The plot nicely shows the fit to the data and the increasing uncertainty about the conditional mean at the tails of the independent variable. We also see that these data are easily modeled by a linear regression, which we could add to the plot.

plot(y1 ~ x, pch = 15, col = rgb(0, 0, 1, 0.5))
# one SE
polygon(c(1:50, 50:1), c(localp$fit - localp$se.fit, rev(localp$fit + localp$se.fit)), 
    col = rgb(1, 0, 0, 0.2), border = NA)
# two SEs
polygon(c(1:50, 50:1), c(localp$fit - 2 * localp$se.fit, rev(localp$fit + 2 * 
    localp$se.fit)), col = rgb(1, 0, 0, 0.2), border = NA)
# three SEs
polygon(c(1:50, 50:1), c(localp$fit - 3 * localp$se.fit, rev(localp$fit + 3 * 
    localp$se.fit)), col = rgb(1, 0, 0, 0.2), border = NA)
# loess curve:
lines(1:50, localp$fit, col = "white", lwd = 2)
# overlay a linear fit:
abline(lm(y1 ~ x), lwd = 2)

plot of chunk unnamed-chunk-6

Loess works well in a linear situation, but in those cases we're better off fitting the linear model because then we can get directly interpretable coefficients. The major downside of local regression is that we can only see it and understand it as a graph.

We can repeat the above process for our second outcome, which lacks a clear linear relationship between predictor x and outcome y2:

localfit <- loess(y2 ~ x)
localp <- predict(localfit, data.frame(x = 1:50), se = TRUE)
plot(y2 ~ x, pch = 15, col = rgb(0, 0, 1, 0.5))
# one SE
polygon(c(1:50, 50:1), c(localp$fit - localp$se.fit, rev(localp$fit + localp$se.fit)), 
    col = rgb(1, 0, 0, 0.2), border = NA)
# two SEs
polygon(c(1:50, 50:1), c(localp$fit - 2 * localp$se.fit, rev(localp$fit + 2 * 
    localp$se.fit)), col = rgb(1, 0, 0, 0.2), border = NA)
# three SEs
polygon(c(1:50, 50:1), c(localp$fit - 3 * localp$se.fit, rev(localp$fit + 3 * 
    localp$se.fit)), col = rgb(1, 0, 0, 0.2), border = NA)
# loess curve:
lines(1:50, localp$fit, col = "white", lwd = 2)
# overlay a linear fit and associated standard errors:
lmfit <- lm(y2 ~ x)
abline(lmfit, lwd = 2)
lmp <- predict(lmfit, data.frame(x = 1:50), se.fit = TRUE)
lines(1:50, lmp$fit - lmp$se.fit, lty = 2)
lines(1:50, lmp$fit + lmp$se.fit, lty = 2)

plot of chunk unnamed-chunk-7

In contrast to the data where y1 was a simple function of x, these data are far messier. They are not well-represented by a straight line fit (as evidenced by our overlay of a linear fit to the data). Instead, the local regression approach shows how y2 is not a clean function of the predictor. In these situations, the local regression curve can be helpful for understanding the relationship between outcome and predictor and potentially for building a subsequent parametric model that approximates the data better than a straight line.

Logicals

Logicals are a fundamental tool for using R in a sophisticated way. Logicals allow us to precisely select elements of an R object (e.g., a vector or dataframe) based upon criteria and to selectively perform operations.

R supports all of the typical mathematical comparison operators: Equal to:

1 == 2
## [1] FALSE

Note: Double equals == is a logical test. Single equals = means right-to-left assignment. Greater than:

1 > 2
## [1] FALSE

Greater than or equal to:

1 >= 2
## [1] FALSE

Less than:

1 < 2
## [1] TRUE

Less than or equal to:

1 <= 2
## [1] TRUE

Note: Less than or equal to <= looks like <-, which means right-to-left assignment.

Spacing between the numbers and operators is not important:

1 == 2
## [1] FALSE
1 == 2
## [1] FALSE

But, spacing between multiple operators is! The following:

# 1 > = 2

produces an error!

The results of these comparisons is a logical vector that has values TRUE, FALSE, or NA:

is.logical(TRUE)  #' valid logical
## [1] TRUE
is.logical(FALSE)  #' valid logical
## [1] TRUE
is.logical(NA)  #' valid logical
## [1] TRUE
is.logical(45)  #' invalid
## [1] FALSE
is.logical("hello")  #' invalid
## [1] FALSE

Because logicals only take values of TRUE or FALSE, values of 1 or 0 can be coerced to logical:

as.logical(0)
## [1] FALSE
as.logical(1)
## [1] TRUE
as.logical(c(0, 0, 1, 0, NA))
## [1] FALSE FALSE  TRUE FALSE    NA

And, conversely, logicals can be coerced back to integer using mathematical operators:

TRUE + TRUE + FALSE
## [1] 2
FALSE - TRUE
## [1] -1
FALSE + 5
## [1] 5

Logical comparisons can also be applied to vectors:

a <- 1:10
a > 5
##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

This produces a logical vector. This is often useful for indexing:

a[a > 5]
## [1]  6  7  8  9 10

We can also apply multiple logical conditions using boolean operators (AND and OR):

a > 4 & a < 9
##  [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
a > 7 | a == 2
##  [1] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE

Complex conditions can also be combined with parentheses to build a logical:

(a > 5 & a < 8) | (a < 3)
##  [1]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE

There is also a xor function to enforce strict OR (but not AND) logic:

xor(TRUE, FALSE)
## [1] TRUE
xor(TRUE, TRUE)
## [1] FALSE
xor(FALSE, FALSE)
## [1] FALSE

This becomes helpful, for example, if we want to create a new vector based on values of an old vector:

b <- a
b[b > 5] <- 1
b
##  [1] 1 2 3 4 5 1 1 1 1 1

It is also possible to convert a logical vector into a positional vector using which:

a > 5
##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
which(a > 5)
## [1]  6  7  8  9 10

Of course, this is only helful in some contexts because:

a[a > 5]
## [1]  6  7  8  9 10
a[which(a > 5)]
## [1]  6  7  8  9 10

produce the same result.

We can also invert a logical (turn TRUE into FALSE, and vice versa) using the exclamaation point (!):

!TRUE
## [1] FALSE
!FALSE
## [1] TRUE
b == 3
##  [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
!b == 3
##  [1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

We can also use an if-else construction to define a new vector conditional on an old vector: For example, we could produce our b vector from above using the ifelse function:

ifelse(a > 5, 1, a)
##  [1] 1 2 3 4 5 1 1 1 1 1

This tests each element of a. If that elements meets the condition, it returns the next value (1), otherwise it returns the value of a. We could modify this slightly to instead return 2 rather than the original value when an element fails the condition:

ifelse(a > 5, 1, 2)
##  [1] 2 2 2 2 2 1 1 1 1 1

This gives us an indicator vector.

Set membership

An especially helpful logical comparator checks of a vector in another vector:

d <- 1:5
e <- 4:7
d %in% e
## [1] FALSE FALSE FALSE  TRUE  TRUE
e %in% d
## [1]  TRUE  TRUE FALSE FALSE

R has several other functions related to sets (e.g., union, intersection) but these produce numeric, not logical output.

Vectorization

Note: The ifelse function demonstrates an R feature called “vectorization.” This means that the function operates on each element in the vector rather than having to test each element separately. Many R functions rely on vectorization, which makes them easy to write and fast for the computer to execute.

Matrices

Matrices are a two-dimensional data structure that are quite useful, especially for statistics in R. Just like in mathematical notation, an R matrix is an m-by-n grid of elements. To create a matrix, we use the matrix function, which we supply with several parameters including the content of the matrix and its dimensions. If we just give a matrix a data parameter, it produces a column vector:

matrix(1:6)
##      [,1]
## [1,]    1
## [2,]    2
## [3,]    3
## [4,]    4
## [5,]    5
## [6,]    6

If we want the matrix to have different dimensions we can specify nrow and/or ncol parameters:

matrix(1:6, nrow = 2)
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
matrix(1:6, ncol = 3)
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
matrix(1:6, nrow = 2, ncol = 3)
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

By default, the data are filled into the resulting matrix “column-wise”. If we specify byrow=TRUE, the elements are instead filled in “row-wise”:

matrix(1:6, nrow = 2, ncol = 3, byrow = TRUE)
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6

Requesting a matrix smaller than the supplied data parameter will result in only some of the data being used and the rest discarded:

matrix(1:6, nrow = 2, ncol = 1)
##      [,1]
## [1,]    1
## [2,]    2

Note: requesting a matrix with larger dimensions than the data produces a warning:

matrix(1:6, nrow = 2, ncol = 4)
## Warning: data length [6] is not a sub-multiple or multiple of the number
## of columns [4]
##      [,1] [,2] [,3] [,4]
## [1,]    1    3    5    1
## [2,]    2    4    6    2

In this example, we still receive a matrix but the matrix elements outside of our data are filled in automatically. This process is called “recycling” in which R repeats the data until it fills in the requested dimensions of the matrix.

Just as with using length to count the elements in a vector, we can use several functions to measure a matrix object. If we apply the function length to matrix, it still counts all the elements in the matrix, but doesn't tell us about dimensions:

a <- matrix(1:10, nrow = 2)
length(a)
## [1] 10

If we want to get the number of rows in the matrix, we can use nrow:

nrow(a)
## [1] 2

If we want to get the number of columns in the matrix, we can use ncol:

ncol(a)
## [1] 5

We can also get the number of rows and the number of columns in a single call to dim:

dim(a)
## [1] 2 5

We can also combine (or bind) vectors and/or matrices together using cbind and rbind. rbind is used to “row-bind” by stacking vectors and/or matrices on top of one another vertically. cbind is used to “column-bind” by stacking vectors and/or matrices next to one another horizontally.

rbind(1:3, 4:6, 7:9)
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9
cbind(1:3, 4:6, 7:9)
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

We can also easily transpose a matrix using t:

rbind(1:3, 4:6, 7:9)
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9
t(rbind(1:3, 4:6, 7:9))
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

Matrix indexing

Indexing a matrix is very similar to indexing a vector, except now we have to account for two dimensions. The first dimension is rows. The second dimension is columns.

b <- rbind(1:3, 4:6, 7:9)
b[1, ]  #' first row
## [1] 1 2 3
b[, 1]  #' first column
## [1] 1 4 7
b[1, 1]  #' element in first row and first column
## [1] 1

Just with vector indexing, we can extract multiple elements:

b[1:2, ]
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
b[1:2, 2:3]
##      [,1] [,2]
## [1,]    2    3
## [2,]    5    6

And we can also use - indexing:

b[-1, 2:3]
##      [,1] [,2]
## [1,]    5    6
## [2,]    8    9

We can also use logical indexing in the same way:

b[c(TRUE, TRUE, FALSE), ]
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
b[, c(TRUE, FALSE, TRUE)]
##      [,1] [,2]
## [1,]    1    3
## [2,]    4    6
## [3,]    7    9

Diagonal and triangles

It is sometimes helpful to extract the diagonal of matrix (e.g., the diagonal of a variance-covariance matrix) Diagonals can be extracted using diag:

diag(b)
## [1] 1 5 9

It is also possible to use diag to assign new values to the diagonal of a matrix. For example, we might want to make all of the diagonal elements 0:

b
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9
diag(b) <- 0
b
##      [,1] [,2] [,3]
## [1,]    0    2    3
## [2,]    4    0    6
## [3,]    7    8    0

We can also extra the upper or lower triangles of a matrix (e.g., to extract one half of a correlation matrix) upper.tri and lower.tri produce logical matrices of the same dimension as the original matrix, which can then be used to index:

upper.tri(b)  #' upper triangle
##       [,1]  [,2]  [,3]
## [1,] FALSE  TRUE  TRUE
## [2,] FALSE FALSE  TRUE
## [3,] FALSE FALSE FALSE
b[upper.tri(b)]
## [1] 2 3 6
lower.tri(b)  #' lower triangle
##       [,1]  [,2]  [,3]
## [1,] FALSE FALSE FALSE
## [2,]  TRUE FALSE FALSE
## [3,]  TRUE  TRUE FALSE
b[lower.tri(b)]
## [1] 4 7 8

Matrix names

Recall that vectors can have named elements. Matrices can have named dimensions. Each row and column of a matrix can have a name that is supplied when it is created or added/modified later.

c <- matrix(1:6, nrow = 2)

Row names are added with rownames:

rownames(c) <- c("Row1", "Row2")

Column names are added with colnames:

colnames(c) <- c("x", "y", "z")

Dimension names can also be added initially when the matrix is created using the dimnames parameter in matrix:

matrix(1:6, nrow = 2, dimnames = list(c("Row1", "Row2"), c("x", "y", "z")))
##      x y z
## Row1 1 3 5
## Row2 2 4 6

Dimension names can also be created in this way for only the rows or columns by using a NULL value for one of the dimensions:

matrix(1:6, nrow = 2, dimnames = list(c("Row1", "Row2"), NULL))
##      [,1] [,2] [,3]
## Row1    1    3    5
## Row2    2    4    6

Matrix algebra

Scalar addition/subtraction

Scalar addition and subtraction on a matrix works identically to addition or subtraction on a vector. We simply use the standard addition (+) and subtraction (-) operators.

a <- matrix(1:6, nrow = 2)
a
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
a + 1
##      [,1] [,2] [,3]
## [1,]    2    4    6
## [2,]    3    5    7
a - 2
##      [,1] [,2] [,3]
## [1,]   -1    1    3
## [2,]    0    2    4

Scalar multiplication/division

Scalar multiplication and division also work with the standard operators (* and /).

a * 2
##      [,1] [,2] [,3]
## [1,]    2    6   10
## [2,]    4    8   12
a/2
##      [,1] [,2] [,3]
## [1,]  0.5  1.5  2.5
## [2,]  1.0  2.0  3.0

Matrix comparators, logicals, and assignment

As with a vector, it is possible to apply comparators to an entire matrix:

a > 2
##       [,1] [,2] [,3]
## [1,] FALSE TRUE TRUE
## [2,] FALSE TRUE TRUE

We can then use the resulting logical matrix as an index:

a[a > 2]
## [1] 3 4 5 6

But the result is a vector, not a matrix. If we use the same statement to assign, however, the result is a matrix:

a[a > 2] <- 99
a
##      [,1] [,2] [,3]
## [1,]    1   99   99
## [2,]    2   99   99

Matrix Multiplication

In statistics, an important operation is matrix multiplication. Unlike scalar multiplication, this procedure involves the multiplication of two matrices by one another.

Let's start by defining a function to demonstrate how matrix multiplication works:

mmdemo <- function(A, B) {
    m <- nrow(A)
    n <- ncol(B)
    C <- matrix(NA, nrow = m, ncol = n)
    for (i in 1:m) {
        for (j in 1:n) {
            C[i, j] <- paste("(", A[i, ], "*", B[, j], ")", sep = "", collapse = "+")
        }
    }
    print(C, quote = FALSE)
}

Now let's generate two matrices, multiply them and see how it worked:

amat <- matrix(1:4, ncol = 2)
bmat <- matrix(1:6, nrow = 2)
amat
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
bmat
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
amat %*% bmat
##      [,1] [,2] [,3]
## [1,]    7   15   23
## [2,]   10   22   34
mmdemo(amat, bmat)
##      [,1]        [,2]        [,3]       
## [1,] (1*1)+(3*2) (1*3)+(3*4) (1*5)+(3*6)
## [2,] (2*1)+(4*2) (2*3)+(4*4) (2*5)+(4*6)

Let's try it on a different set of matrices:

amat <- matrix(1:16, ncol = 4)
bmat <- matrix(1:32, nrow = 4)
amat
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
## [3,]    3    7   11   15
## [4,]    4    8   12   16
bmat
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,]    1    5    9   13   17   21   25   29
## [2,]    2    6   10   14   18   22   26   30
## [3,]    3    7   11   15   19   23   27   31
## [4,]    4    8   12   16   20   24   28   32
amat %*% bmat
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,]   90  202  314  426  538  650  762  874
## [2,]  100  228  356  484  612  740  868  996
## [3,]  110  254  398  542  686  830  974 1118
## [4,]  120  280  440  600  760  920 1080 1240
mmdemo(amat, bmat)
##      [,1]                      [,2]                     
## [1,] (1*1)+(5*2)+(9*3)+(13*4)  (1*5)+(5*6)+(9*7)+(13*8) 
## [2,] (2*1)+(6*2)+(10*3)+(14*4) (2*5)+(6*6)+(10*7)+(14*8)
## [3,] (3*1)+(7*2)+(11*3)+(15*4) (3*5)+(7*6)+(11*7)+(15*8)
## [4,] (4*1)+(8*2)+(12*3)+(16*4) (4*5)+(8*6)+(12*7)+(16*8)
##      [,3]                         [,4]                         
## [1,] (1*9)+(5*10)+(9*11)+(13*12)  (1*13)+(5*14)+(9*15)+(13*16) 
## [2,] (2*9)+(6*10)+(10*11)+(14*12) (2*13)+(6*14)+(10*15)+(14*16)
## [3,] (3*9)+(7*10)+(11*11)+(15*12) (3*13)+(7*14)+(11*15)+(15*16)
## [4,] (4*9)+(8*10)+(12*11)+(16*12) (4*13)+(8*14)+(12*15)+(16*16)
##      [,5]                          [,6]                         
## [1,] (1*17)+(5*18)+(9*19)+(13*20)  (1*21)+(5*22)+(9*23)+(13*24) 
## [2,] (2*17)+(6*18)+(10*19)+(14*20) (2*21)+(6*22)+(10*23)+(14*24)
## [3,] (3*17)+(7*18)+(11*19)+(15*20) (3*21)+(7*22)+(11*23)+(15*24)
## [4,] (4*17)+(8*18)+(12*19)+(16*20) (4*21)+(8*22)+(12*23)+(16*24)
##      [,7]                          [,8]                         
## [1,] (1*25)+(5*26)+(9*27)+(13*28)  (1*29)+(5*30)+(9*31)+(13*32) 
## [2,] (2*25)+(6*26)+(10*27)+(14*28) (2*29)+(6*30)+(10*31)+(14*32)
## [3,] (3*25)+(7*26)+(11*27)+(15*28) (3*29)+(7*30)+(11*31)+(15*32)
## [4,] (4*25)+(8*26)+(12*27)+(16*28) (4*29)+(8*30)+(12*31)+(16*32)

Note: matrix multiplication is noncommutative, so the order of matrices matters in a statement!

Cross-product

Another important operation is the crossproduct. See also: OLS in matrix form.

Row/column means and sums

Sometimes we want to calculate a sum or mean for each row or column of a matrix. R provides built-in functions for each of these operations:

cmat <- matrix(1:20, nrow = 5)
cmat
##      [,1] [,2] [,3] [,4]
## [1,]    1    6   11   16
## [2,]    2    7   12   17
## [3,]    3    8   13   18
## [4,]    4    9   14   19
## [5,]    5   10   15   20
rowSums(cmat)
## [1] 34 38 42 46 50
colSums(cmat)
## [1] 15 40 65 90
rowMeans(cmat)
## [1]  8.5  9.5 10.5 11.5 12.5
colMeans(cmat)
## [1]  3  8 13 18

These functions can be helpful for aggregating multiple variables and performing the sum or mean with these functions is much faster than manually adding (or taking the mean) of columns using + and / operators.

Missing data

Missing data values in R are a major point of confusion. This script walks through some of the basics of missing data. Where some statistical packages have different kinds of missing data, R only has one

NA
## [1] NA

NA means a missing value. For example, in a vector variable, we might be missing the third observation:

a <- c(1, 2, NA, 4, 5)
a
## [1]  1  2 NA  4  5

This impacts our ability to do calculations on the vector, like taking its sum:

sum(a)
## [1] NA

This is because R treats anything mathematically calculated with an NA as missing:

1 + NA
## [1] NA
0 + NA
## [1] NA

This can cause some confusion because many statistical packages omit missing values by default. The R approach is better because it forces you to be conscious about where data are missing.

Another point of confusion is that some things look like missing data but are not. For example, the NULL value is not missing. Note the difference between a and b:

a
## [1]  1  2 NA  4  5
b <- c(1, 2, NULL, 4, 5)
b
## [1] 1 2 4 5

b has only four elements. NULL is not missing, it is simply dropped.

This can be especially confusing when a vector is of character class. For example, compare c to d:

c <- c("do", "re", NA, "fa")
c
## [1] "do" "re" NA   "fa"
d <- c("do", "re", "NA", "fa")
d
## [1] "do" "re" "NA" "fa"

The third element of c is missing (NA), whereas the third element of d is a charater string 'NA'. We can see this with the logical test is.na:

is.na(c)
## [1] FALSE FALSE  TRUE FALSE
is.na(d)
## [1] FALSE FALSE FALSE FALSE

This tests whether each element in a vector is missing. Similarly, an empty character string is not missing:

is.na("")
## [1] FALSE

It is simply a character string that has no contents. For example, compare c to e:

c
## [1] "do" "re" NA   "fa"
e <- c("do", "re", "", "fa")
e
## [1] "do" "re" ""   "fa"
is.na(c)
## [1] FALSE FALSE  TRUE FALSE
is.na(e)
## [1] FALSE FALSE FALSE FALSE

There may be situations in which we want to change missing NA values or remove them entirely. For example, to change all NA values in a vector to 0, we could use logical indexing:

f <- c(1, 2, NA, NA, NA, 6, 7)
f
## [1]  1  2 NA NA NA  6  7
f[is.na(f)] <- 0
f
## [1] 1 2 0 0 0 6 7

Alternatively, there may be situations where we want convert NA values to NULL values, and thus shorten our vector:

g1 <- c(1, 2, NA, NA, NA, 6, 7)
g2 <- na.omit(g1)
g2
## [1] 1 2 6 7
## attr(,"na.action")
## [1] 3 4 5
## attr(,"class")
## [1] "omit"

We now have shorter vector:

length(g1)
## [1] 7
length(g2)
## [1] 4

But that vector has been given an additional attribute: a vector of positions of omitted missing values:

attributes(g2)$na.action
## [1] 3 4 5
## attr(,"class")
## [1] "omit"

Many functions also provide the ability to exclude missing values from a calculation. For example, to calculate the sum of g1 we could either use the na.omit function or an na.rm parameter in sum:

sum(na.omit(g1))
## [1] 16
sum(g1, na.rm = TRUE)
## [1] 16

Both provide the same answer. Many functions in R allow an na.rm parameter (or something similarly).

Missing data handling

Missing data is a pain. It creates problems for simple and complicated analyses. It also tend to undermine our ability to make valid inferences. Most statistical packages tend to “brush missing data under the rug” and simply delete missing cases on the fly. This is nice because it makes analysis simple: e.g., if you want a mean of a variable with missing data, most packages drop the missing data and report the mean of the remaining values. But, a different view is also credible: the assumption that we should discard missing values may be a bad assumption. For example, let's say that we want to build a regression model to explain two outcomes but those outcome variables have different patterns of missing data. If we engage in “on-the-fly” case deletion, then we end up with two models that are built on different, non-comparable subsets of the original data. We are then limited in our ability to compare, e.g., the coefficients from one model to the other because they have different data bases. Choosing how to deal with missing values is thus better done as an intentional activity early in the process of data analysis rather than as an analysis-specific assumption.

This tutorial demonstrates some basic missing data handling procedures. A separate tutorial on multiple imputation covers advanced techniques.

Local NA handling

When R encounters missing data, its typical behavior is to attempt to perform the requested procedure and then returns a missing (NA) value as a result. We can see this if we attempt to calculate the mean of a vector containing missing data:

x <- c(1, 2, 3, NA, 5, 7, 9)
mean(x)
## [1] NA

R is telling us here that our vector contains missing data, so the requested statistic - the mean - is undefined for these data. If we want to do - as many statistical packages do by default - and calculate the mean by dropping the missing value, we just need to request that R remove the missing values using the na.rm=TRUE argument:

mean(x, na.rm = TRUE)
## [1] 4.5

na.rm can be found in many R functions, such as mean, median, sd, var, and so forth. One exception to this is the summary function when applied to a vector of data. By default it counts missing values and then reports the mean, median, and other statistics excluding those value:

summary(x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    2.25    4.00    4.50    6.50    9.00       1

Another common function that handles missing values atypically is the correlation (cor) function. Rather than accepting an na.rm argument, it has a use argument that specifies what set of cases to use when calculating the correlation coefficient. Its default behavior - like mean, median, etc. - is to attempt to calculate the correlation coefficient with use="everything". This can result in an NA result:

y <- c(3, 2, 4, 5, 1, 3, 4)
cor(x, y)
## [1] NA

The use argument can take several values (see ?cor), but the two most common useful are use="complete.obs" and use="pairwise.complete.obs". The former deletes all cases with missing values before calculating the correlation. The latter applies where trying to build a correlation matrix (i.e., correlations between more than two variables) and instead of dropping all cases with any missing data, it only drops cases from each pairwise correlation calculation. We can see this if we build a three-variable matrix:

z <- c(NA, 2, 3, 5, 4, 3, 4)
m <- data.frame(x, y, z)
m
##    x y  z
## 1  1 3 NA
## 2  2 2  2
## 3  3 4  3
## 4 NA 5  5
## 5  5 1  4
## 6  7 3  3
## 7  9 4  4
cor(m)  # returns all NAs
##    x  y  z
## x  1 NA NA
## y NA  1 NA
## z NA NA  1
cor(m, use = "complete.obs")
##        x       y       z
## x 1.0000 0.34819 0.70957
## y 0.3482 1.00000 0.04583
## z 0.7096 0.04583 1.00000
cor(m, use = "pairwise.complete.obs")
##        x      y      z
## x 1.0000 0.2498 0.7096
## y 0.2498 1.0000 0.4534
## z 0.7096 0.4534 1.0000

Under default settings, the response is a matrix of NA values. With use="complete.obs", the matrix m first has all cases with missing values removed, then the correlation matrix is produced. Whereas with use="pairwise.complete.obs", the cases with missing values are only removed during the calculation of each pairwise correlation. Thus we see that the correlation between x and z is the same in both matrices but the correlation between y and both x and z depends on the use method (with dramatic effect).

Regression NA handling

Another places where missing data are handled atypically is in regression modeling. If we estimate a linear regression model for our x, z, and y data, R will default to casewise deletion. We can see this here:

lm <- lm(y ~ x + z, data = m)
summary(lm)
## 
## Call:
## lm(formula = y ~ x + z, data = m)
## 
## Residuals:
##      2      3      5      6      7 
## -0.632  1.711 -1.237 -0.447  0.605 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    3.316      3.399    0.98     0.43
## x              0.289      0.408    0.71     0.55
## z             -0.632      1.396   -0.45     0.70
## 
## Residual standard error: 1.65 on 2 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.203,  Adjusted R-squared:  -0.594 
## F-statistic: 0.254 on 2 and 2 DF,  p-value: 0.797

The model, obviously, can only fit the model to the available data, so the resulting fitted values have a different length from the original data:

length(m$y)
## [1] 7
length(lm$fitted)
## [1] 5

Thus, if we tried to store our fitted values back into our m dataframe (e.g., using m$fitted <- lm$fitted) or plot our model residuals against the original outcome y (e.g., with plot(lm$residuals ~ m$y)), we would encounter an error. This is typical of statistical packages, but highlights that we should really address missing data before we start any of our analysis.

Global NA handling

How do we deal with missing data globally? Basically, we need to decide how we're going to use our missing data, if at all, then either remove cases from our data or impute missing values, and then proceed with our analysis. As mentioned, one strategy is multiple imputation, which is addressed in a separate tutorial. Before we deal with missing data, it is helpful to know where it lies in our data: We can look for missing data in a vector by simply wrapping it in is.na:

is.na(x)
## [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE

We can also do the same for an entire dataframe:

is.na(m)
##          x     y     z
## [1,] FALSE FALSE  TRUE
## [2,] FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE
## [4,]  TRUE FALSE FALSE
## [5,] FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE
## [7,] FALSE FALSE FALSE

That works fine in our small example, but in a very large dataset, that could get quite difficult to understand. Therefore, it is helpful to visualize missing data in a plot. We can use the image function to visualize the is.na(m) matrix:

image(is.na(m), main = "Missing Values", xlab = "Observation", ylab = "Variable", 
    xaxt = "n", yaxt = "n", bty = "n")
axis(1, seq(0, 1, length.out = nrow(m)), 1:nrow(m), col = "white")
axis(2, c(0, 0.5, 1), names(m), col = "white", las = 2)

plot of chunk unnamed-chunk-10

Note: The syntax here is a little bit tricky, but it is simply to make the plot easier to understand. See ?image for more details. The plot shows we have two missing values: one in our z variable for observation 1 and one in our x variable for observation 4. This plot can help us understand where our missing data is and if we systematically observe missing data for certain types of observations.

Once we know where our missing data are, we can deal with them in some way. Casewise deletion is the easiest way to deal with missing data. It simply removes all cases that have missing data anywhere in the data. To do casewise deletion, we simply using the na.omit function on our entire dataframe:

na.omit(m)
##   x y z
## 2 2 2 2
## 3 3 4 3
## 5 5 1 4
## 6 7 3 3
## 7 9 4 4

In our example data, this procedure removes two rows that contain missing values. Note: using na.omit(m) does not affect our original object m. To use the new dataframe, we need to save it as an object:

m2 <- na.omit(m)

This let's us easily go back to our original data:

m
##    x y  z
## 1  1 3 NA
## 2  2 2  2
## 3  3 4  3
## 4 NA 5  5
## 5  5 1  4
## 6  7 3  3
## 7  9 4  4
m2
##   x y z
## 2 2 2 2
## 3 3 4 3
## 5 5 1 4
## 6 7 3 3
## 7 9 4 4

Another strategy is some kind of imputation. There are an endless number of options here - and the best way is probably multiple imputation, which is described elsewhere - but two ways to do simple, single imputation is to replace missing values with the means of the other values in the variable or to randomly sample from those values. The former approach (mean imputation) preserves the mean of the variable, whereas the latter approach (random imputation) preserves both the mean and variance. Both might be unreasonable, but its worth seeing how to do them: To do mean imputation we simply need to identify missing values, calculate the mean of the remaining values, and store that mean into those missing value possitions:

x2 <- x
x2
## [1]  1  2  3 NA  5  7  9
is.na(x2)
## [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
x2[is.na(x2)]
## [1] NA
mean(x2, na.rm = TRUE)
## [1] 4.5
x2[is.na(x2)] <- mean(x2, na.rm = TRUE)
x2
## [1] 1.0 2.0 3.0 4.5 5.0 7.0 9.0

To do random imputation is a bit more complicated because we need to sample the non-missing values with the sample function, but the process is otherwise similar:

x3 <- x
x3[!is.na(x3)]  # values from which we can sample
## [1] 1 2 3 5 7 9
x3[is.na(x3)] <- sample(x3[!is.na(x3)], sum(is.na(x3)), TRUE)
x3
## [1] 1 2 3 5 5 7 9

Thus these two imputation strategies produce different resulting data (and those data will reflect the statistical properties of the original data to varying extents), but they mean that all subsequent analysis will not have to worry about missing values.

Model Formulae

One of the most important object classes for statistics in R is the “formula” class. Formula objects, while unimportant for R in general, are critical to many of statistical tests and statistical plots in R (as well as many add-on packages). Formulae convey a relationship among a set of variables in a simple, intuitive way. They are also data-independent, meaning that a formula can be constructed and then used with application to different dataframes or subsets of a dataframe. This means we can define formulae without having any data loaded. Note: We did not discuss formulas in the tutorials on object classes, because they are not one of the fundamental classes needed throughout R. They are only needed for statistical procedures, which we happen to need a lot in academic research but aren't as critical in other uses of R.

Formula basics

The basic structure of a formula is the tilde symbol (~) and at least one independent (righthand) variable. In most (but not all) situations, a single dependent (lefthand) variable is also needed. Thus we can construct a formula quite simply by just typing:

~x
## ~x
## <environment: 0x000000001c3d67b0>

Note: Spaces in formulae are not important. And, like any other object, we can store this as an R variable and see that it is, in fact, a formula:

myformula <- ~x
class(myformula)
## [1] "formula"

More commonly, we want to express a formula as a relationship between an outcome (lefthand) variable and one or more independent/predictor/covariate (righthand) variables:

myformula <- y ~ x

We can use multiple independent variables by simply separating them with the plus (+) symbol:

y ~ x1 + x2
## y ~ x1 + x2
## <environment: 0x000000001c3d67b0>

If we use a minus (-) symbol, objects in the formula are ignored in an analysis:

y ~ x1 - x2
## y ~ x1 - x2
## <environment: 0x000000001c3d67b0>

One particularly helpful feature when modelling with lots of variables is the . operator. When used in a formula, . refers to all other variables in the matrix not yet included in the model. So, if we plan to run a regression on a matrix (or dataframe) containing the variables y, x1, z3, and areallylongvariablename, we can simply use the formula:

y ~ .
## y ~ .
## <environment: 0x000000001c3d67b0>

and avoid having to type all of the variables.

Interaction terms

In a regression modeling context, we often need to specify interaction terms. There are two ways to do this. If we want to include two variables and their interaction, we use the star/asterisk (*) symbol:

y ~ x1 * x2
## y ~ x1 * x2
## <environment: 0x000000001c3d67b0>

If we only want their interaction, but not the variables themselves, we use the colon (:) symbol:

y ~ x1:x2
## y ~ x1:x2
## <environment: 0x000000001c3d67b0>

Note: We probably don't want to do this. This means that some formulae that look different are actually equivalent. The following formulae will produce the same regression:

y ~ x1 * x2
## y ~ x1 * x2
## <environment: 0x000000001c3d67b0>
y ~ x1 + x2 + x1:x2
## y ~ x1 + x2 + x1:x2
## <environment: 0x000000001c3d67b0>

Regression formulae

In regression models, we may also want to know a few other tricks. One trick is to drop the intercept, by either including a zero (0) or a minus-one (-1) in the formula:

y ~ -1 + x1 * x2
## y ~ -1 + x1 * x2
## <environment: 0x000000001c3d67b0>
y ~ 0 + x1 * x2
## y ~ 0 + x1 * x2
## <environment: 0x000000001c3d67b0>

We can also offset the intercept of a model using the offset function. The use is kind of strange and not that common, but we can increase the intercept by, e.g., 2 using:

y ~ x1 + offset(rep(-2, n))
## y ~ x1 + offset(rep(-2, n))
## <environment: 0x000000001c3d67b0>

or reduce the intercept by, e.g., 3 using:

y ~ x1 + offset(rep(3, n))
## y ~ x1 + offset(rep(3, n))
## <environment: 0x000000001c3d67b0>

Note: The n here would have to be tailed to the length of the actual. It's unclear in what context this functionality is really helpful, but it does mean that models can be adjusted in fairly sophisticated ways.

Factor variables

An important consideration in regression formulae is the handling of factor-class variables. When a factor is included in a regression model, it is automatically converted into a series of indicator (“dummy”) variables, with the factor's first level treated as a baseline. This also means that we can convert non-factor variables into a series of dummies, simply by wrapping them in factor:

y ~ x
## y ~ x
## <environment: 0x000000001c3d67b0>
# to:
y ~ factor(x)
## y ~ factor(x)
## <environment: 0x000000001c3d67b0>

As-is variables

One trick to formulas is that they don't evaluate their contents. So, for example, if we wanted to include x and x^2 in our model, we might intuit that we should type:

y ~ x + x^2
## y ~ x + x^2
## <environment: 0x000000001c3d67b0>

If we attempted to estimate a regression model using this formula, R would drop the x^2 term because it thinks it is a duplicate of x. We therefore have to either calculate and store all of the variables we want to include in the model in advance, or we need to use the I() “as-is” operator. To obtain our desired two-term formula, we could use I() as follows:

y ~ x + I(x^2)
## y ~ x + I(x^2)
## <environment: 0x000000001c3d67b0>

This tells R to calculate the values of x^2 before attempting to use the formula. Aside from calculating powers, I() can also be helpful when we want to rescale a variable for a model (e.g., to make two coefficients more comparable by using a common scale). Again, we simply wrap the relevant variable name in I():

y ~ I(2 * x)
## y ~ I(2 * x)
## <environment: 0x000000001c3d67b0>

This formula would, in a linear regression, produce a coefficient half as large as the model for y~x.

Formulae as character strings

One might be tempted to compare a formula to a character string. They look similar, but they are different. Their similar means, however, that a character string containing a formula can often be used where a formula-class object is required. Indeed the following is true:

("y ~ x") == (y ~ x)
## [1] TRUE

And we can easily convert between formula and character class:

as.formula("y~x")
## y ~ x
## <environment: 0x000000001c3d67b0>
as.character(y ~ x)
## [1] "~" "y" "x"

Note: The result of the latter is probably not what you expected. But relates to how formulae are indexed:

(y ~ x)[1]
## `~`()
## <environment: 0x000000001c3d67b0>
(y ~ x)[2]
## y()
(y ~ x)[3]
## x()

The ability to easily transform between formula and character class means that we can also build formulae on the fly using paste. For example, if we want to add righthand variables to a formula, we can simply paste them:

paste("y~x", "x2", "x3", sep = "+")
## [1] "y~x+x2+x3"

Advanced formula manipulation

One of the really nice features of formulae is that they have many methods. For example, we can use the terms function to examine and compare different formulae:

terms(y ~ x1 + x2)
## y ~ x1 + x2
## attr(,"variables")
## list(y, x1, x2)
## attr(,"factors")
##    x1 x2
## y   0  0
## x1  1  0
## x2  0  1
## attr(,"term.labels")
## [1] "x1" "x2"
## attr(,"order")
## [1] 1 1
## attr(,"intercept")
## [1] 1
## attr(,"response")
## [1] 1
## attr(,".Environment")
## <environment: 0x000000001c3d67b0>
terms(y ~ 0 + x1)
## y ~ 0 + x1
## attr(,"variables")
## list(y, x1)
## attr(,"factors")
##    x1
## y   0
## x1  1
## attr(,"term.labels")
## [1] "x1"
## attr(,"order")
## [1] 1
## attr(,"intercept")
## [1] 0
## attr(,"response")
## [1] 1
## attr(,".Environment")
## <environment: 0x000000001c3d67b0>
terms(~x1 + x2)
## ~x1 + x2
## attr(,"variables")
## list(x1, x2)
## attr(,"factors")
##    x1 x2
## x1  1  0
## x2  0  1
## attr(,"term.labels")
## [1] "x1" "x2"
## attr(,"order")
## [1] 1 1
## attr(,"intercept")
## [1] 1
## attr(,"response")
## [1] 0
## attr(,".Environment")
## <environment: 0x000000001c3d67b0>

The output above shows the formula itself, a list of its constitutive variables, the presence of intercept, the presence of an outcome, and so forth. If we just want to know the names of the variables in the model, we can use all.vars:

all.vars(y ~ x1 + x2)
## [1] "y"  "x1" "x2"

We can also modify formulae without converting them to character (as we did above), using the update function. This potentially saves a lot of typing:

update(y ~ x, ~. + x2)
## y ~ x + x2
## <environment: 0x000000001c3d67b0>
update(y ~ x, z ~ .)
## z ~ x
## <environment: 0x000000001c3d67b0>

This could be used, e.g., to run a model on a “small” model and then a larger version:

myformula <- y ~ a + b + c
update(myformula, "~.+d+e+f")
## y ~ a + b + c + d + e + f
## <environment: 0x000000001c3d67b0>

Or the same righthand variables to predict two different outcomes:

update(myformula, "z~.")
## z ~ a + b + c
## <environment: 0x000000001c3d67b0>

We can also drop terms using update:

update(myformula, "~.-a")
## y ~ b + c
## <environment: 0x000000001c3d67b0>

Multinomial Outcome Models

One important, but sometimes problematic, class of regression models deals with nominal or multinomial outcomes (i.e., outcomes that are not continuous or even ordered). Estimating these models is not possible with glm, but can be estimated using the nnet add-on package, which is recommended and therefore simply needs to be loaded. Let's start by loading the package, or installing then loading it if it isn't already on our system:

install.packages("nnet", repos = "http://cran.r-project.org")
## Warning: package 'nnet' is in use and will not be installed
library(nnet)

Then let's create some simple bivariate data where the outcome y takes three values

set.seed(100)
y <- sort(sample(1:3, 600, TRUE))
x <- numeric(length = 600)
x[1:200] <- -1 * x[1:200] + rnorm(200, 4, 2)
x[201:400] <- 1 * x[201:400] + rnorm(200)
x[401:600] <- 2 * x[401:600] + rnorm(200, 2, 2)

We can plot the data to see what's going on:

plot(y ~ x, col = rgb(0, 0, 0, 0.3), pch = 19)
abline(lm(y ~ x), col = "red")  # a badly fitted regression line

plot of chunk unnamed-chunk-3

Clearly, there is a relationship between x and y, but it's certainly not linear and if we tried to draw a line through through the data (i.e., the straight red regression line), many of the predicted values would be problematic because y can only take on discrete values 1,2, and 3 and, in fact, the line hardly fits the data at all. We might therefore rely on a multinomial model, which will give us the coefficients for x for each level of the outcome. In other words, the coefficients from a multinomial logistic model express effects in terms of moving from the baseline category of the outcome to the other levels of the outcome (essentially combining several binary logistic regression models into a single model). Let's look at the output from the multinom function to see what these results look like:

m1 <- multinom(y ~ x)
## # weights:  9 (4 variable)
## initial  value 659.167373 
## iter  10 value 535.823756
## iter  10 value 535.823754
## final  value 535.823754 
## converged
summary(m1)
## Call:
## multinom(formula = y ~ x)
## 
## Coefficients:
##   (Intercept)       x
## 2       1.849 -0.8620
## 3       1.126 -0.3208
## 
## Std. Errors:
##   (Intercept)       x
## 2      0.1900 0.07096
## 3      0.1935 0.05141
## 
## Residual Deviance: 1072 
## AIC: 1080

Our model only consists of one covariate, but we now see two intercept coefficients and two slope coefficients because the model is telling us the relationship between x and y in terms of moving from category 1 to category 2 in y and from category 1 to category 3 in y, respectively. The standard errors are printed below the coefficients. Unfortunately, its almost impossible to interpret the coefficients here because a unit change in x has some kind of negative impact on both levels of y, but we don't know how much.

Predicted values from multinomial models

A better way to examine the effects in a multinomial model is to look at predicted probabilities. We need to start with some new data representing the full scale of the x variable:

newdata <- data.frame(x = seq(min(x), max(x), length.out = 100))

Like with binary models, we can extract different kinds of predictions from the model using predict. The first type of prediction is simply the fitted “class” or level of the outcome:

p1 <- predict(m1, newdata, type = "class")

The second is a predicted probability of being in each category of y. In other words, for each value of our new data, predict with type="class" will return, in our example, three predicted probabilities.

p2 <- predict(m1, newdata, type = "probs")

These probabilities also all sum to a value of 1, which means that the model requires that the categories of y be mutually exclusive and comprehensive. There's no opportunity for x to predict a value outside of those included in the model. You can verify this using rowSums:

rowSums(p2)
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  91  92  93  94  95  96  97  98  99 100 
##   1   1   1   1   1   1   1   1   1   1

If you want to relax this constraint, you can separately model your data using two or more binary logistic regressions comparing different categories. For example, we could model the data, predicting y==2 against y==1 and separately y==3 against y==1. Do this we can create two subsets of our original data to separately y==2 from y==3.

df1 <- data.frame(x = x, y = y)[y %in% c(1, 2), ]
df1$y <- df1$y - 1  # recode 2 to 1 and 1 to 0
df2 <- data.frame(x = x, y = y)[y %in% c(1, 3), ]
df2$y[df2$y == 1] <- 0  # recode 1 to 0
df2$y[df2$y == 3] <- 1  # recode 3 to 1

We can then model this and compare to the coefficients from the multinomial model:

coef(glm(y ~ x, data = df1, family = binomial))  # predict 2 against 1
## (Intercept)           x 
##      1.8428     -0.8839
coef(glm(y ~ x, data = df2, family = binomial))  # predict 3 against 1
## (Intercept)           x 
##      1.1102     -0.3144
coef(m1)  # multinomial model
##   (Intercept)       x
## 2       1.849 -0.8620
## 3       1.126 -0.3208

Clearly, the coefficients from the two modeling strategies are similar, but not identical. The multinomial probably imposes a more plausible assumption (that predicted probabilities sum to 1), but you can easily try both approaches.

Plotting predicted classes

One way to visualize the results of a multinomial model is simply to plot our fitted values for y on top of our original data:

plot(y ~ x, col = rgb(0, 0, 0, 0.05), pch = 19)
lines(newdata$x, p1, col = rgb(1, 0, 0, 0.75), lwd = 5)

plot of chunk unnamed-chunk-11

This plot shows that as we increase along x, observations are first likely to be in y==2, then y==3, and finally y==1. Unfortunately, using the lines function gives a slight misrepresentation because of the vertical discontinuities. We can draw three separate lines to have a more accurate picture:

plot(y ~ x, col = rgb(0, 0, 0, 0.05), pch = 19)
lines(newdata$x[p1 == 1], p1[p1 == 1], col = "red", lwd = 5)
lines(newdata$x[p1 == 2], p1[p1 == 2], col = "red", lwd = 5)
lines(newdata$x[p1 == 3], p1[p1 == 3], col = "red", lwd = 5)

plot of chunk unnamed-chunk-12

Plotting predicted probabilities

Plotting fitted values is helpful, but doesn't give us a sense of uncertainty. Obviously the red lines in the previous plots show the category that we are most likely to observe for a given value of x, but it doesn't show us how likely an observation is to be in the other categories. To see that, we need to look at predicted probabilities. Let's start by looking at the predicted probabilities object p2:

head(p2)
##         1      2       3
## 1 0.01083 0.9021 0.08702
## 2 0.01210 0.8949 0.09298
## 3 0.01350 0.8872 0.09929
## 4 0.01506 0.8790 0.10596
## 5 0.01678 0.8702 0.11300
## 6 0.01868 0.8609 0.12041

As stated above, this object contains a predicted probability of being in each category of y for a given value of x. The simplest plot of this is simply three lines, each of which is color coded to represent categories 1,2, and 3 of y, respectively:

plot(NA, xlim = c(min(x), max(x)), ylim = c(0, 1), xlab = "x", ylab = "Predicted Probability")
lines(newdata$x, p2[, 1], col = "red", lwd = 2)
lines(newdata$x, p2[, 2], col = "blue", lwd = 2)
lines(newdata$x, p2[, 3], col = "green", lwd = 2)
# some text labels help clarify things:
text(9, 0.75, "y==1", col = "red")
text(6, 0.4, "y==3", col = "green")
text(5, 0.15, "y==2", col = "blue")

plot of chunk unnamed-chunk-14

This plot gives us a bit more information than simply plotting predicted classes (as above). We now see that middling values of x are only somewhat more likely to be in category y==3 than in the other categories, whereas at extreme values of x, the data are much more likely to be in categories y==1 and y==2. A slightly more attractive variance of this uses the polygon plotting function rather than lines. Text labels might higlight We can also optionally add a horizontal bar at the base of the plot to highlight the predicted class for each value of x:

plot(NA, xlim = c(min(x), max(x)), ylim = c(0, 1), xlab = "x", ylab = "Predicted Probability", 
    bty = "l")
# polygons
polygon(c(newdata$x, rev(newdata$x)), c(p2[, 1], rep(0, nrow(p2))), col = rgb(1, 
    0, 0, 0.3), border = rgb(1, 0, 0, 0.3))
polygon(c(newdata$x, rev(newdata$x)), c(p2[, 2], rep(0, nrow(p2))), col = rgb(0, 
    0, 1, 0.3), border = rgb(0, 0, 1, 0.3))
polygon(c(newdata$x, rev(newdata$x)), c(p2[, 3], rep(0, nrow(p2))), col = rgb(0, 
    1, 0, 0.3), border = rgb(0, 1, 0, 0.3))
# text labels
text(9, 0.4, "y=1", font = 2)
text(2.5, 0.4, "y=3", font = 2)
text(-1.5, 0.4, "y=2", font = 2)
# optionally highlight predicted class:
lines(newdata$x[p1 == 1], rep(0, sum(p1 == 1)), col = "red", lwd = 3)
lines(newdata$x[p1 == 2], rep(0, sum(p1 == 2)), col = "blue", lwd = 3)
lines(newdata$x[p1 == 3], rep(0, sum(p1 == 3)), col = "green", lwd = 3)

plot of chunk unnamed-chunk-15

This plot nicely highlights both the fitted class but also the uncertainty associated with similar predicted probabilities at some values of x. Multinomial regression models can be difficult to interpret, but taking the few simple steps to estimate predicted probabilities and fitted classes and then plotting those estimates in some way can make the models much more intuitive.

Multiple imputation

This tutorial covers techniques of multiple imputation. Multiple imputation is a strategy for dealing with missing data. Whereas we typically (i.e., automatically) deal with missing data through casewise deletion of any observations that have missing values on key variables, imputation attempts to replace missing values with an estimated value. In single imputation, we guess that missing value one time (perhaps based on the means of observed values, or a random sampling of those values). In multiple imputation, we instead draw multiple values for each missing value, effectively building multiple datasets, each of which replaces the missing data in a different way. There are numerous algorithms for this, each of which builds those multiple datasets in different ways. We're not going to discuss the details here, but instead focus on executing multiple imputation in R. The main challenge of multiple imputation is not the analysis (it simply proceeds as usual on each imputed dataset) but instead the aggregation of those separate analyses. The examples below discuss how to do this.

To get a basic feel for the process, let's imagine that we're trying to calculate the mean of a vector of values that contains missing values. We can impute the missing values by drawing from the observed values, repeat the process several times, and then average across the estimated means to get an estimate of the mean with a measure of uncertainty that accounts for the uncertainty due to imputation. Let's create a vector of ten values, seven of which we observe and three of which are missing, and imagine that they are random draws from the population whose mean we're trying to estimate:

set.seed(10)
x <- c(sample(1:10, 7, TRUE), rep(NA, 3))
x
##  [1]  6  4  5  7  1  3  3 NA NA NA

We can find the mean using case deletion:

mean(x, na.rm = TRUE)
## [1] 4.143

Our estimate of the sample standard error is then:

sd(x, na.rm = TRUE)/sqrt(sum(!is.na(x)))
## [1] 0.7693

Now let's impute several times to generate a list of imputed vectors:

imp <- replicate(15, c(x[!is.na(x)], sample(x[!is.na(x)], 3, TRUE)), simplify = FALSE)
imp
## [[1]]
##  [1] 6 4 5 7 1 3 3 4 1 7
## 
## [[2]]
##  [1] 6 4 5 7 1 3 3 1 7 6
## 
## [[3]]
##  [1] 6 4 5 7 1 3 3 1 5 7
## 
## [[4]]
##  [1] 6 4 5 7 1 3 3 6 4 5
## 
## [[5]]
##  [1] 6 4 5 7 1 3 3 3 3 1
## 
## [[6]]
##  [1] 6 4 5 7 1 3 3 3 5 5
## 
## [[7]]
##  [1] 6 4 5 7 1 3 3 1 3 4
## 
## [[8]]
##  [1] 6 4 5 7 1 3 3 3 5 7
## 
## [[9]]
##  [1] 6 4 5 7 1 3 3 6 4 3
## 
## [[10]]
##  [1] 6 4 5 7 1 3 3 5 3 3
## 
## [[11]]
##  [1] 6 4 5 7 1 3 3 3 1 7
## 
## [[12]]
##  [1] 6 4 5 7 1 3 3 4 4 6
## 
## [[13]]
##  [1] 6 4 5 7 1 3 3 3 4 4
## 
## [[14]]
##  [1] 6 4 5 7 1 3 3 6 7 6
## 
## [[15]]
##  [1] 6 4 5 7 1 3 3 3 5 3

The result is a list of five vectors. The first seven values of each is the same as our original data, but the three missing values have been replaced with different combinations of the observed values. To get our new estimated maen, we simply take the mean of each vector, and then average across them:

means <- sapply(imp, mean)
means
##  [1] 4.1 4.3 4.2 4.4 3.6 4.2 3.7 4.4 4.2 4.0 4.0 4.3 4.0 4.8 4.0
grandm <- mean(means)
grandm
## [1] 4.147

The result is 4.147, about the same as our original estimate. To get the standard error of our multiple imputation estimate, we need to combine the standard errors of each of our estimates, so that means we need to start by getting the SEs of each imputed vector:

ses <- sapply(imp, sd)/sqrt(10)

Aggregating the standard errors is a bit complicated, but basically sums the mean of the SEs (i.e., the “within-imputation variance”) with the variance across the different estimated means (the “between-imputation variance”). To calculate the within-imputation variance, we simply average the SE estimates:

within <- mean(ses)

To calculate the between-imputation variance, we calculate the sum of squared deviations of each imputed mean from the grand mean estimate:

between <- sum((means - grandm)^2)/(length(imp) - 1)

Then we sum the within- and between-imputation variances (multiply the latter by a small correction):

grandvar <- within + ((1 + (1/length(imp))) * between)
grandse <- sqrt(grandvar)
grandse
## [1] 0.8387

The resulting standard error is interesting because we increase the precision of our estimate by using 10 rather than 7 values (and standard errors are proportionate to sample size), but is larger than our original standard error because we have to account for uncertainty due to imputation. Thus if our missing values are truly missing at random, we can get a better estimate that is actually representative of our original population. Most multiple imputation algorithms are, however, applied to multivariate data rather than a single data vector and thereby use additional information about the relationship between observed values and missingness to reach even more precise estimates of target parameters.

There are three main R packages that offer multiple imputation techniques. Several other packages - described in the OfficialStatistics Task View - supply other imputation techniques, but packages Amelia (by Gary King and collaborators), mi (by Andrew Gelman and collaborators), and mice (by Stef van Buuren and collaborators) provide more than enough to work with. Let's start by installing these packages:

install.packages(c("Amelia", "mi", "mice"), repos = "http://cran.r-project.org")
## Warning: packages 'Amelia', 'mi', 'mice' are in use and will not be
## installed

Now, let's consider an imputation situation where we plan to conduct a regression analysis predicting y by two covariates: x1 and x2 but we have missing data in x1 and x2. Let's start by creating the dataframe:

x1 <- runif(100, 0, 5)
x2 <- rnorm(100)
y <- x1 + x2 + rnorm(100)
mydf <- cbind.data.frame(x1, x2, y)

Now, let's randomly remove some of the observed values of the independent variables:

mydf$x1[sample(1:nrow(mydf), 20, FALSE)] <- NA
mydf$x2[sample(1:nrow(mydf), 10, FALSE)] <- NA

The result is the removal of thirty values, 20 from x1 and 10 from x2:

summary(mydf)
##        x1              x2               y        
##  Min.   :0.098   Min.   :-2.321   Min.   :-1.35  
##  1st Qu.:1.138   1st Qu.:-0.866   1st Qu.: 1.17  
##  Median :2.341   Median : 0.095   Median : 2.39  
##  Mean   :2.399   Mean   :-0.038   Mean   : 2.28  
##  3rd Qu.:3.626   3rd Qu.: 0.724   3rd Qu.: 3.69  
##  Max.   :4.919   Max.   : 2.221   Max.   : 6.26  
##  NA's   :20      NA's   :10

If we estimate the regression on these data, R will force casewise deletion of 28 cases:

lm <- lm(y ~ x1 + x2, data = mydf)
summary(lm)
## 
## Call:
## lm(formula = y ~ x1 + x2, data = mydf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5930 -0.7222  0.0018  0.7140  2.4878 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.0259     0.2196    0.12     0.91    
## x1            0.9483     0.0824   11.51  < 2e-16 ***
## x2            0.7487     0.1203    6.23  3.3e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.969 on 69 degrees of freedom
##   (28 observations deleted due to missingness)
## Multiple R-squared:  0.706,  Adjusted R-squared:  0.698 
## F-statistic: 82.9 on 2 and 69 DF,  p-value: <2e-16

We should thus be quite skeptical of our results given taht we're discarding a substantial portion of our observations (28%, in fact). Let's see how the various multiple imputation packages address this and affect our inference.

Amelia

library(Amelia)
imp.amelia <- amelia(mydf)
## -- Imputation 1 --
## 
##   1  2  3  4  5  6  7  8  9
## 
## -- Imputation 2 --
## 
##   1  2  3  4  5  6  7
## 
## -- Imputation 3 --
## 
##   1  2  3  4  5  6
## 
## -- Imputation 4 --
## 
##   1  2  3  4  5  6
## 
## -- Imputation 5 --
## 
##   1  2  3  4  5  6  7

Once we've run our multiple imputation, we can see where are missing data lie:

missmap(imp.amelia)

plot of chunk unnamed-chunk-17

We can also run our regression model on each imputed dataset. We'll use the lapply function to do this quickly on each of the imputed dataframes:

lm.amelia.out <- lapply(imp.amelia$imputations, function(i) lm(y ~ x1 + x2, 
    data = i))

If we look at lm.amelia.out we'll see the results of the model run on each imputed dataframe separately:

lm.amelia.out
## $imp1
## 
## Call:
## lm(formula = y ~ x1 + x2, data = i)
## 
## Coefficients:
## (Intercept)           x1           x2  
##       0.247        0.854        0.707  
## 
## 
## $imp2
## 
## Call:
## lm(formula = y ~ x1 + x2, data = i)
## 
## Coefficients:
## (Intercept)           x1           x2  
##       0.164        0.931        0.723  
## 
## 
## $imp3
## 
## Call:
## lm(formula = y ~ x1 + x2, data = i)
## 
## Coefficients:
## (Intercept)           x1           x2  
##      0.0708       0.9480       0.8234  
## 
## 
## $imp4
## 
## Call:
## lm(formula = y ~ x1 + x2, data = i)
## 
## Coefficients:
## (Intercept)           x1           x2  
##      0.0656       0.9402       0.6446  
## 
## 
## $imp5
## 
## Call:
## lm(formula = y ~ x1 + x2, data = i)
## 
## Coefficients:
## (Intercept)           x1           x2  
##      -0.064        0.956        0.820

To aggregate across the results is a little bit tricky because we have to extract the coefficients and standard errors from each model, format them in a particular way, and then feed that structure into the mi.meld function:

coefs.amelia <- do.call(rbind, lapply(lm.amelia.out, function(i) coef(summary(i))[, 
    1]))
ses.amelia <- do.call(rbind, lapply(lm.amelia.out, function(i) coef(summary(i))[, 
    2]))
mi.meld(coefs.amelia, ses.amelia)
## $q.mi
##      (Intercept)    x1     x2
## [1,]     0.09683 0.926 0.7436
## 
## $se.mi
##      (Intercept)      x1     x2
## [1,]      0.2291 0.08222 0.1267

Now let's compare these results to those of our original model:

t(do.call(rbind, mi.meld(coefs.amelia, ses.amelia)))
##                [,1]    [,2]
## (Intercept) 0.09683 0.22908
## x1          0.92598 0.08222
## x2          0.74359 0.12674
coef(summary(lm))[, 1:2]  # original results
##             Estimate Std. Error
## (Intercept)  0.02587    0.21957
## x1           0.94835    0.08243
## x2           0.74874    0.12026

mi

library(mi)

Let's start by visualizing the missing data:

mp.plot(mydf)

plot of chunk unnamed-chunk-23

We can then see some summary information about the dataset and the nature of the missingness:

mi.info(mydf)
##   names include order number.mis all.mis                type collinear
## 1    x1     Yes     1         20      No positive-continuous        No
## 2    x2     Yes     2         10      No          continuous        No
## 3     y     Yes    NA          0      No          continuous        No

With that information confirmed, it is incredibly issue to conduct our multiple imputation using the mi function:

imp.mi <- mi(mydf)
## Beginning Multiple Imputation ( Wed Nov 13 22:07:34 2013 ):
## Iteration 1 
##  Chain 1 : x1*  x2*  
##  Chain 2 : x1*  x2*  
##  Chain 3 : x1*  x2*  
## Iteration 2 
##  Chain 1 : x1*  x2   
##  Chain 2 : x1*  x2*  
##  Chain 3 : x1*  x2*  
## Iteration 3 
##  Chain 1 : x1*  x2*  
##  Chain 2 : x1*  x2   
##  Chain 3 : x1*  x2   
## Iteration 4 
##  Chain 1 : x1   x2   
##  Chain 2 : x1*  x2   
##  Chain 3 : x1   x2   
## Iteration 5 
##  Chain 1 : x1*  x2   
##  Chain 2 : x1   x2   
##  Chain 3 : x1   x2*  
## Iteration 6 
##  Chain 1 : x1*  x2   
##  Chain 2 : x1   x2   
##  Chain 3 : x1   x2*  
## Iteration 7 
##  Chain 1 : x1   x2*  
##  Chain 2 : x1*  x2   
##  Chain 3 : x1   x2   
## Iteration 8 
##  Chain 1 : x1   x2   
##  Chain 2 : x1*  x2   
##  Chain 3 : x1*  x2   
## Iteration 9 
##  Chain 1 : x1   x2   
##  Chain 2 : x1   x2   
##  Chain 3 : x1*  x2   
## Iteration 10 
##  Chain 1 : x1   x2   
##  Chain 2 : x1   x2   
##  Chain 3 : x1   x2   
## Iteration 11 
##  Chain 1 : x1*  x2*  
##  Chain 2 : x1   x2   
##  Chain 3 : x1   x2   
## Iteration 12 
##  Chain 1 : x1   x2   
##  Chain 2 : x1   x2   
##  Chain 3 : x1   x2   
## Iteration 13 
##  Chain 1 : x1   x2   
##  Chain 2 : x1   x2   
##  Chain 3 : x1   x2   
## Iteration 14 
##  Chain 1 : x1   x2   
##  Chain 2 : x1   x2   
##  Chain 3 : x1   x2   
## mi converged ( Wed Nov 13 22:07:37 2013 )
## Run 20 more iterations to mitigate the influence of the noise...
## Beginning Multiple Imputation ( Wed Nov 13 22:07:37 2013 ):
## Iteration 1 
##  Chain 1 : x1   x2   
##  Chain 2 : x1   x2   
##  Chain 3 : x1   x2   
## Iteration 2 
##  Chain 1 : x1   x2   
##  Chain 2 : x1   x2   
##  Chain 3 : x1   x2   
## Iteration 3 
##  Chain 1 : x1   x2   
##  Chain 2 : x1   x2   
##  Chain 3 : x1   x2   
## Iteration 4 
##  Chain 1 : x1   x2   
##  Chain 2 : x1   x2   
##  Chain 3 : x1   x2   
## Iteration 5 
##  Chain 1 : x1   x2   
##  Chain 2 : x1   x2   
##  Chain 3 : x1   x2   
## Iteration 6 
##  Chain 1 : x1   x2   
##  Chain 2 : x1   x2   
##  Chain 3 : x1   x2   
## Iteration 7 
##  Chain 1 : x1   x2   
##  Chain 2 : x1   x2   
##  Chain 3 : x1   x2   
## Iteration 8 
##  Chain 1 : x1   x2   
##  Chain 2 : x1   x2   
##  Chain 3 : x1   x2   
## Iteration 9 
##  Chain 1 : x1   x2   
##  Chain 2 : x1   x2   
##  Chain 3 : x1   x2   
## Iteration 10 
##  Chain 1 : x1   x2   
##  Chain 2 : x1   x2   
##  Chain 3 : x1   x2   
## Iteration 11 
##  Chain 1 : x1   x2   
##  Chain 2 : x1   x2   
##  Chain 3 : x1   x2   
## Iteration 12 
##  Chain 1 : x1   x2   
##  Chain 2 : x1   x2   
##  Chain 3 : x1   x2   
## Iteration 13 
##  Chain 1 : x1   x2   
##  Chain 2 : x1   x2   
##  Chain 3 : x1   x2   
## Iteration 14 
##  Chain 1 : x1   x2   
##  Chain 2 : x1   x2   
##  Chain 3 : x1   x2   
## Iteration 15 
##  Chain 1 : x1   x2   
##  Chain 2 : x1   x2   
##  Chain 3 : x1   x2   
## Iteration 16 
##  Chain 1 : x1   x2   
##  Chain 2 : x1   x2   
##  Chain 3 : x1   x2   
## Iteration 17 
##  Chain 1 : x1   x2   
##  Chain 2 : x1   x2   
##  Chain 3 : x1   x2   
## Iteration 18 
##  Chain 1 : x1   x2   
##  Chain 2 : x1   x2   
##  Chain 3 : x1   x2   
## Iteration 19 
##  Chain 1 : x1   x2   
##  Chain 2 : x1   x2   
##  Chain 3 : x1   x2   
## Iteration 20 
##  Chain 1 : x1   x2   
##  Chain 2 : x1   x2   
##  Chain 3 : x1   x2   
## mi converged ( Wed Nov 13 22:07:41 2013 )
imp.mi
## 
## Multiply imputed data set
## 
## Call:
##  .local(object = object, n.iter = ..3, R.hat = ..4, max.minutes = ..2, 
##     run.past.convergence = TRUE)
## 
## Number of multiple imputations:  3 
## 
## Number and proportion of missing data per column:
##   names                type number.mis proportion
## 1    x1 positive-continuous         20        0.2
## 2    x2          continuous         10        0.1
## 3     y          continuous          0        0.0
## 
## Total Cases: 100
## Missing at least one item: 2
## Complete cases: 72

The results above report how many imputed datasets were produced and summarizes some of the results we saw above. For linear regression (and several other common models), the mi package includes functions that automatically run the model on each imputed dataset and aggregate the results:

lm.mi.out <- lm.mi(y ~ x1 + x2, imp.mi)

We can extract the results using the following:

coef.mi <- [email protected]
# or see them quickly with:
display(lm.mi.out)
## =======================================
## Separate Estimates for each Imputation
## =======================================
## 
## ** Chain 1 **
## lm(formula = formula, data = mi.data[[i]])
##             coef.est coef.se
## (Intercept) -0.01     0.18  
## x1           0.96     0.06  
## x2           0.75     0.09  
## ---
## n = 100, k = 3
## residual sd = 0.89, R-Squared = 0.75
## 
## ** Chain 2 **
## lm(formula = formula, data = mi.data[[i]])
##             coef.est coef.se
## (Intercept) 0.03     0.18   
## x1          0.94     0.07   
## x2          0.67     0.09   
## ---
## n = 100, k = 3
## residual sd = 0.91, R-Squared = 0.74
## 
## ** Chain 3 **
## lm(formula = formula, data = mi.data[[i]])
##             coef.est coef.se
## (Intercept) -0.02     0.20  
## x1           0.96     0.07  
## x2           0.69     0.10  
## ---
## n = 100, k = 3
## residual sd = 0.96, R-Squared = 0.71
## 
## =======================================
## Pooled Estimates
## =======================================
## lm.mi(formula = y ~ x1 + x2, mi.object = imp.mi)
##             coef.est coef.se
## (Intercept) 0.00     0.19   
## x1          0.95     0.07   
## x2          0.71     0.10   
## ---

Let's compare these results to our original model:

do.call(cbind, coef.mi)  # multiply imputed results
##             coefficients      se
## (Intercept)      0.00123 0.18878
## x1               0.95311 0.06901
## x2               0.70687 0.10411
coef(summary(lm))[, 1:2]  # original results
##             Estimate Std. Error
## (Intercept)  0.02587    0.21957
## x1           0.94835    0.08243
## x2           0.74874    0.12026

mice

library(mice)

To conduct the multiple imputation, we simply need to run the mice function:

imp.mice <- mice(mydf)
## 
##  iter imp variable
##   1   1  x1  x2
##   1   2  x1  x2
##   1   3  x1  x2
##   1   4  x1  x2
##   1   5  x1  x2
##   2   1  x1  x2
##   2   2  x1  x2
##   2   3  x1  x2
##   2   4  x1  x2
##   2   5  x1  x2
##   3   1  x1  x2
##   3   2  x1  x2
##   3   3  x1  x2
##   3   4  x1  x2
##   3   5  x1  x2
##   4   1  x1  x2
##   4   2  x1  x2
##   4   3  x1  x2
##   4   4  x1  x2
##   4   5  x1  x2
##   5   1  x1  x2
##   5   2  x1  x2
##   5   3  x1  x2
##   5   4  x1  x2
##   5   5  x1  x2

We can see some summary information about the imputation process:

summary(imp.mice)
## Multiply imputed data set
## Call:
## mice(data = mydf)
## Number of multiple imputations:  5
## Missing cells per column:
## x1 x2  y 
## 20 10  0 
## Imputation methods:
##    x1    x2     y 
## "pmm" "pmm"    "" 
## VisitSequence:
## x1 x2 
##  1  2 
## PredictorMatrix:
##    x1 x2 y
## x1  0  1 1
## x2  1  0 1
## y   0  0 0
## Random generator seed value:  NA

To run our regression we use the lm function wrapped in a with call, which estimates our model on each imputed dataframe:

lm.mice.out <- with(imp.mice, lm(y ~ x1 + x2))
summary(lm.mice.out)
## 
##  ## summary of imputation 1 :
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7893 -0.7488  0.0955  0.7205  2.3768 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.0636     0.1815    0.35     0.73    
## x1            0.9528     0.0660   14.44  < 2e-16 ***
## x2            0.6548     0.0916    7.15  1.6e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.915 on 97 degrees of freedom
## Multiple R-squared:  0.735,  Adjusted R-squared:  0.73 
## F-statistic:  135 on 2 and 97 DF,  p-value: <2e-16
## 
## 
##  ## summary of imputation 2 :
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4456 -0.6911  0.0203  0.6839  2.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.1618     0.1855    0.87     0.39    
## x1            0.8849     0.0671   13.19  < 2e-16 ***
## x2            0.7424     0.0942    7.88  4.7e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.939 on 97 degrees of freedom
## Multiple R-squared:  0.721,  Adjusted R-squared:  0.716 
## F-statistic:  126 on 2 and 97 DF,  p-value: <2e-16
## 
## 
##  ## summary of imputation 3 :
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5123 -0.6683  0.0049  0.6717  2.5072 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.0800     0.1872    0.43     0.67    
## x1            0.9402     0.0674   13.95  < 2e-16 ***
## x2            0.8150     0.0947    8.61  1.3e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.94 on 97 degrees of freedom
## Multiple R-squared:  0.721,  Adjusted R-squared:  0.715 
## F-statistic:  125 on 2 and 97 DF,  p-value: <2e-16
## 
## 
##  ## summary of imputation 4 :
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5739 -0.7310  0.0152  0.6534  2.4748 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.0787     0.1815    0.43     0.67    
## x1            0.9438     0.0660   14.30  < 2e-16 ***
## x2            0.7833     0.0916    8.55  1.8e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.922 on 97 degrees of freedom
## Multiple R-squared:  0.732,  Adjusted R-squared:  0.726 
## F-statistic:  132 on 2 and 97 DF,  p-value: <2e-16
## 
## 
##  ## summary of imputation 5 :
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6130 -0.6085  0.0085  0.6907  2.4719 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.0477     0.1858    0.26      0.8    
## x1            0.9463     0.0672   14.07  < 2e-16 ***
## x2            0.7436     0.0893    8.33  5.4e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.91 on 97 degrees of freedom
## Multiple R-squared:  0.738,  Adjusted R-squared:  0.733 
## F-statistic:  137 on 2 and 97 DF,  p-value: <2e-16

The results above are for each separate dataset. But, to pool them, we use pool:

pool.mice <- pool(lm.mice.out)

Let's compare these results to our original model:

summary(pool.mice)  # multiply imputed results
##                 est      se       t    df  Pr(>|t|)   lo 95  hi 95 nmis
## (Intercept) 0.08637 0.19056  0.4533 81.41 6.516e-01 -0.2927 0.4655   NA
## x1          0.93361 0.07327 12.7422 50.19 0.000e+00  0.7865 1.0808   20
## x2          0.74782 0.11340  6.5944 22.53 1.104e-06  0.5130 0.9827   10
##                 fmi  lambda
## (Intercept) 0.08664 0.06447
## x1          0.20145 0.17025
## x2          0.38954 0.33765
coef(summary(lm))[, 1:2]  # original results
##             Estimate Std. Error
## (Intercept)  0.02587    0.21957
## x1           0.94835    0.08243
## x2           0.74874    0.12026

Comparing packages

It is useful at this point to compare the coefficients from each of our multiple imputation methods. To do so, we'll pull out the coefficients from each of the three packages' results, our original observed results (with case deletion), and the results for the real data-generating process (before we introduced missingness). Amelia package results

s.amelia <- t(do.call(rbind, mi.meld(coefs.amelia, ses.amelia)))

mi package results

s.mi <- do.call(cbind, coef.mi)  # multiply imputed results

mice package results

s.mice <- summary(pool.mice)[, 1:2]  # multiply imputed results

Original results (case deletion)

s.orig <- coef(summary(lm))[, 1:2]  # original results

Real results (before missingness was introduced)

s.real <- summary(lm(y ~ x1 + x2))$coef[, 1:2]

Let's print the coefficients together to compare them:

allout <- cbind(s.real[, 1], s.amelia[, 1], s.mi[, 1], s.mice[, 1], s.orig[, 
    1])
colnames(allout) <- c("Real Relationship", "Amelia", "MI", "mice", "Original")
allout
##             Real Relationship  Amelia      MI    mice Original
## (Intercept)           0.04502 0.09683 0.00123 0.08637  0.02587
## x1                    0.95317 0.92598 0.95311 0.93361  0.94835
## x2                    0.82900 0.74359 0.70687 0.74782  0.74874

All three of the multiple imputation models - despite vast differences in underlying approaches to imputation in the three packages - yield strikingly similar inference. This was a relatively basic and all of the packages offer a number of options for more complicated situations than what we examined here. While executing multiple imputation requires choosing a package and typing some potentially tedious code, the results are almost always going to be better than doing the easier thing of deleting cases and ignoring the consequences thereof.

Multivariate Regression

The bivariate OLS tutorial covers most of the details of model building and output, so this tutorial is comparatively short. It addresses some additional details about multivariate OLS models.

We'll begin by generating some fake data involving a few covariates. We'll then generate two outcomes, one that is a simple linear function of the covariates and one that involves an interaction.

set.seed(50)
n <- 200
x1 <- rbinom(n, 1, 0.5)
x2 <- rnorm(n)
x3 <- rnorm(n, 0, 4)
y1 <- x1 + x2 + x3 + rnorm(n)
y2 <- x1 + x2 + x3 + 2 * x1 * x2 + rnorm(n)

Now we can see how to model each of these processes.

Regresssion formulae for multiple covariates

As covered in the formulae tutorial, we can easily represent a multivariate model using a formula just like we did for a bivariate model. For example, a bivariate model might look like:

y1 ~ x1
## y1 ~ x1
## <environment: 0x000000001c3d67b0>

And a multivariate model would look like:

y1 ~ x1 + x2 + x3
## y1 ~ x1 + x2 + x3
## <environment: 0x000000001c3d67b0>

To include the interaction we need to use the * operator, though we could also use : for the same result:

y1 ~ x1 * x2 + x3
## y1 ~ x1 * x2 + x3
## <environment: 0x000000001c3d67b0>
y1 ~ x1 + x2 + x1 * x2 + x3
## y1 ~ x1 + x2 + x1 * x2 + x3
## <environment: 0x000000001c3d67b0>

The order of variables in a regression formula doesn't matter. Generally, R will print out the regression results in the order that the variables are listed in the formula, but there are exception. For example, all interactions are listed after main effects, as we'll see below.

Regression estimation

Estimating a multivariate model is just like a bivariate model:

lm(y1 ~ x1 + x2 + x3)
## 
## Call:
## lm(formula = y1 ~ x1 + x2 + x3)
## 
## Coefficients:
## (Intercept)           x1           x2           x3  
##       0.111        0.924        1.012        0.993

We can do the same with an interaction model:

lm(y1 ~ x1 * x2 + x3)
## 
## Call:
## lm(formula = y1 ~ x1 * x2 + x3)
## 
## Coefficients:
## (Intercept)           x1           x2           x3        x1:x2  
##       0.116        0.917        0.892        0.992        0.228

By default, the estimated coefficients print to the console. If we want to do anything else with the model, we need to store it as an object and then we can perform further procedures:

m1 <- lm(y1 ~ x1 + x2 + x3)
m2 <- lm(y1 ~ x1 * x2 + x3)

Extracting coefficients

To obtain just the coefficients themselves, we can use the coef function applied to the model object:

coef(m1)
## (Intercept)          x1          x2          x3 
##      0.1106      0.9236      1.0119      0.9932

Similarly, we can use residuals to see the model residuals. We'll just list the first 15 here:

residuals(m1)[1:15]
##        1        2        3        4        5        6        7        8 
##  0.11464 -0.51139 -0.53711 -0.39099 -0.05157  0.46566 -0.11753 -1.25015 
##        9       10       11       12       13       14       15 
## -1.03919 -0.32588 -0.97016  0.89039  0.30830 -1.58698  0.83643

The model objects also include all of the data used to estimate the model in a sub-object called model. Let's look at its first few rows:

head(m1$model)
##        y1 x1        x2      x3
## 1   4.012  1 -1.524406   4.436
## 2 -10.837  0  0.043114 -10.551
## 3  -2.991  0  0.084210  -2.668
## 4  -7.166  1  0.354005  -8.223
## 5   4.687  1  1.104567   2.604
## 6   4.333  0  0.004345   3.778

There are lots of other things stored in a model object that don't concern us right now, but that you could see with str(m1) or str(m2).

Regression summaries

Much more information about a model becomes available when we use the summary function:

summary(m1)
## 
## Call:
## lm(formula = y1 ~ x1 + x2 + x3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5808 -0.6384 -0.0632  0.5966  2.8379 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.1106     0.0976    1.13     0.26    
## x1            0.9236     0.1431    6.45  8.4e-10 ***
## x2            1.0119     0.0716   14.13  < 2e-16 ***
## x3            0.9932     0.0168   59.01  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.01 on 196 degrees of freedom
## Multiple R-squared:  0.951,  Adjusted R-squared:  0.951 
## F-statistic: 1.28e+03 on 3 and 196 DF,  p-value: <2e-16

Indeed, as with a bivariate model, a complete representation of the regression results is printed to the console, including coefficients, standard errors, t-statistics, p-values, some summary statistics about the regression residuals, and various model fit statistics. The summary object itself can be saved and objects extracted from it:

s1 <- summary(m1)

A look at the structure of s1 shows that there is considerable detail stored in the summary object:

str(s1)
## List of 11
##  $ call         : language lm(formula = y1 ~ x1 + x2 + x3)
##  $ terms        :Classes 'terms', 'formula' length 3 y1 ~ x1 + x2 + x3
##   .. ..- attr(*, "variables")= language list(y1, x1, x2, x3)
##   .. ..- attr(*, "factors")= int [1:4, 1:3] 0 1 0 0 0 0 1 0 0 0 ...
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:4] "y1" "x1" "x2" "x3"
##   .. .. .. ..$ : chr [1:3] "x1" "x2" "x3"
##   .. ..- attr(*, "term.labels")= chr [1:3] "x1" "x2" "x3"
##   .. ..- attr(*, "order")= int [1:3] 1 1 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: 0x000000001c3d67b0> 
##   .. ..- attr(*, "predvars")= language list(y1, x1, x2, x3)
##   .. ..- attr(*, "dataClasses")= Named chr [1:4] "numeric" "numeric" "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:4] "y1" "x1" "x2" "x3"
##  $ residuals    : Named num [1:200] 0.1146 -0.5114 -0.5371 -0.391 -0.0516 ...
##   ..- attr(*, "names")= chr [1:200] "1" "2" "3" "4" ...
##  $ coefficients : num [1:4, 1:4] 0.1106 0.9236 1.0119 0.9932 0.0976 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:4] "(Intercept)" "x1" "x2" "x3"
##   .. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
##  $ aliased      : Named logi [1:4] FALSE FALSE FALSE FALSE
##   ..- attr(*, "names")= chr [1:4] "(Intercept)" "x1" "x2" "x3"
##  $ sigma        : num 1.01
##  $ df           : int [1:3] 4 196 4
##  $ r.squared    : num 0.951
##  $ adj.r.squared: num 0.951
##  $ fstatistic   : Named num [1:3] 1278 3 196
##   ..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
##  $ cov.unscaled : num [1:4, 1:4] 0.009406 -0.009434 -0.000255 0.000116 -0.009434 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:4] "(Intercept)" "x1" "x2" "x3"
##   .. ..$ : chr [1:4] "(Intercept)" "x1" "x2" "x3"
##  - attr(*, "class")= chr "summary.lm"

This includes all of the details that were printed to the console, which we extract separately, such as the coefficients:

coef(s1)
##             Estimate Std. Error t value   Pr(>|t|)
## (Intercept)   0.1106    0.09757   1.133  2.584e-01
## x1            0.9236    0.14311   6.454  8.374e-10
## x2            1.0119    0.07159  14.135  9.787e-32
## x3            0.9932    0.01683  59.008 9.537e-127
s1$coefficients
##             Estimate Std. Error t value   Pr(>|t|)
## (Intercept)   0.1106    0.09757   1.133  2.584e-01
## x1            0.9236    0.14311   6.454  8.374e-10
## x2            1.0119    0.07159  14.135  9.787e-32
## x3            0.9932    0.01683  59.008 9.537e-127

Model fit statistics:

s1$sigma
## [1] 1.006
s1$r.squared
## [1] 0.9514
s1$adj.r.squared
## [1] 0.9506
s1$fstatistic
## value numdf dendf 
##  1278     3   196

And so forth. These details become useful to be able to extract when we want to output our results to another format, such as Word, LaTeX, or something else.

Interaction Model Output

The output from a model that includes interactions is essentially the same as for a model without any interactions, but note that the interaction coefficients are printed at the end of the output:

coef(m2)
## (Intercept)          x1          x2          x3       x1:x2 
##      0.1161      0.9172      0.8923      0.9925      0.2278
s2 <- summary(m2)
s2
## 
## Call:
## lm(formula = y1 ~ x1 * x2 + x3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6387 -0.6031 -0.0666  0.6153  2.8076 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.1161     0.0973    1.19     0.23    
## x1            0.9172     0.1426    6.43  9.5e-10 ***
## x2            0.8923     0.1035    8.62  2.2e-15 ***
## x3            0.9925     0.0168   59.17  < 2e-16 ***
## x1:x2         0.2278     0.1428    1.59     0.11    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1 on 195 degrees of freedom
## Multiple R-squared:  0.952,  Adjusted R-squared:  0.951 
## F-statistic:  967 on 4 and 195 DF,  p-value: <2e-16
coef(s2)
##             Estimate Std. Error t value   Pr(>|t|)
## (Intercept)   0.1161    0.09725   1.194  2.340e-01
## x1            0.9172    0.14261   6.431  9.536e-10
## x2            0.8923    0.10346   8.625  2.225e-15
## x3            0.9925    0.01677  59.171 1.549e-126
## x1:x2         0.2278    0.14282   1.595  1.123e-01

Plots of Multivariate OLS Models

As with bivariate models, we can easily plot our observed data pairwise:

plot(y1 ~ x2)

plot of chunk unnamed-chunk-17

And we can overlay predicted values of the outcome on that kind of plot:

plot(y1 ~ x2)
abline(m1$coef[1], m1$coef[3])

plot of chunk unnamed-chunk-18

or plot the model residuals against included variables:

layout(matrix(1:2, nrow = 1))
plot(m1$residuals ~ x1)
plot(m1$residuals ~ x2)

plot of chunk unnamed-chunk-19

For more details on plotting regressions, see the section of tutorials on Regression Plotting.

Numeric Printing

While most of R's default print setting are reasonable, it also provides fine-grained control over the display of output. This control is helpful both for looking at data and results, but also for correctly interpreting it and then outputing it to other formats (e.g., for use in a publication).

False Precision

One of the biggest errors made by users of statistical software is the abuse of “false precision.” The idea of “false precision” is that the analyst uses the output of a statistical algorithm directly rather than formatting that output in line with precision of the actual data. Statistical algorithms, when executed by computers, will typically produce output to a finite but very large number of decimal places even though the underlying data only allow precision to a smaller number of decimals. Take for example, the task of calculating the average height of a group of individuals. Perhaps we have a tools capable of measuring height to the nearest centimeter. Let's say this is our data for five individuals:

height <- c(167, 164, 172, 158, 181, 179)

We can then use R to calculate the mean height of this group:

mean(height)
## [1] 170.2

The result is given to four decimal places. But because our data are only precise to whole centimeters, the concept of “significant figures” applies. According to those rules, we can only have a result that is precise to the number of digits in our original data plus one. Our original data have three significant digits so the result can only have one decimal place. The mean is thus 170.2 not 170.167. This is important because we might be tempted to compare our mean to another mean (as part of some analysis) and we can only detect differences at the tenths place but no further. A different group with a calculated mean height 170.191 (by the same measurement tool) would therefore have a mean indistinguishable from that in our group. These kinds of calculations must often be done by hand. But R can do them for using several different functions.

signif and round

The most direct way to properly round our results is with either signif or round. signif rounds to a specified number of significant digits. For our above example with four significant figures, we can use:

signif(mean(height), 4)
## [1] 170.2

An alternative approach is to use round to specify a number of decimal places. For the above example, this would be 1:

round(mean(height), 1)
## [1] 170.2

round also accepts negative values to round to, e.g., tens, hundreds, etc. places:

round(mean(height), -1)
## [1] 170

Figuring out significant figures can sometimes be difficult, particularly when the precision of original data is ambiguous. A good rule of thumb for social science data is two significant digits unless those data are known to have greater precision. As an example, surveys often measure constructs on an arbitrary scale (e.g., 1-7). There is one digit of precision in these data, so any results from them should have only two significant figures.

digits options

While R typically prints to a large number of digits (default on my machine is 7), the above reminds us that we shouldn't listen to R's defaults because they convey false precision. Rather than having to round everything that comes out of R each time, we can also specify a number of digits to round to globally. We might, for example, follow a rule of thumb of two decimal places for our results:

mean(height)
## [1] 170.2
sd(height)
## [1] 8.886
options(digits = 2)
mean(height)
## [1] 170
sd(height)
## [1] 8.9

But we can easily change this again to whatever value we so choose. Note: computers are limited in the number of decimals they can actually store, so requesting large number of decimal places may produce unexpected results.

options(digits = 20)
mean(height)
## [1] 170.16666666666665719
sd(height)
## [1] 8.8863190729720411554
options(digits = 7)

Another useful global option is scipen, which decides whether R reports results in scientific notation. If we specify a negative value for scipen, R will tend to report results in scientific notation. And, if we specify a positive value for scipen, R will tend to report results in fixed notation, even when they are very small or very large. Its default value is 0 (meaning no tendency either way).

options(scipen = -10)
1e+10
## [1] 1e+10
1e-10
## [1] 1e-10
options(scipen = 10)
1e+10
## [1] 10000000000
1e-10
## [1] 0.0000000001
options(scipen = 0)
1e+10
## [1] 1e+10
1e-10
## [1] 1e-10

sprintf

Another strategy for formatting output is the sprintf function. sprintf is very flexible, so I won't explain all the details here, but it can be used to format a number (and other things) into any variety of formats as a character string. Here are some examples from ?sprintf for the display of pi:

sprintf("%f", pi)
## [1] "3.141593"
sprintf("%.3f", pi)
## [1] "3.142"
sprintf("%1.0f", pi)
## [1] "3"
sprintf("%5.1f", pi)
## [1] "  3.1"
sprintf("%05.1f", pi)
## [1] "003.1"
sprintf("%+f", pi)
## [1] "+3.141593"
sprintf("% f", pi)
## [1] " 3.141593"

OLS as Regression on Means

One of the things that I find most difficult about regression is visualizing what is actually happening when a regression model is fit. One way to better understand that process is to recognize that a regression is simply a curve through the conditional mean values of an outcome at each value of one or more predictors. Thus, we can actually estimate (i.e., “figure out”) the regression line simply by determining the conditional mean of our outcome at each value of of our input. This is easiest to see in a bivariate regression, so let's create some data and build the model:

set.seed(100)
x <- sample(0:50, 10000, TRUE)
# xsq <- x^2 # a squared term just for fun the data-generating process:
y <- 2 + x + (x^2) + rnorm(10000, 0, 300)

Now let's calculate the conditional means of x, xsq, and y:

condmeans_x <- by(x, x, mean)
condmeans_x2 <- by(x^2, x, mean)
condmeans_y <- by(y, x, mean)

If we run the regression on the original data (assuming we know the data-generating process), we'll get the following:

lm1 <- lm(y ~ x + I(x^2))
summary(lm1)
## 
## Call:
## lm(formula = y ~ x + I(x^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1229.5  -200.0    -0.7   200.6  1196.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.0643     8.7796    0.92     0.36    
## x            -0.2293     0.8087   -0.28     0.78    
## I(x^2)        1.0259     0.0156   65.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 300 on 9997 degrees of freedom
## Multiple R-squared:  0.871,  Adjusted R-squared:  0.871 
## F-statistic: 3.37e+04 on 2 and 9997 DF,  p-value: <2e-16

If we run the regression instead on just the conditional means (i.e., one value of y at each value of x), we will get the following:

lm2 <- lm(condmeans_y ~ condmeans_x + condmeans_x2)
summary(lm2)
## 
## Call:
## lm(formula = condmeans_y ~ condmeans_x + condmeans_x2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -64.57 -17.20   0.85  17.21  50.86 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.2157     9.7430    0.74     0.46    
## condmeans_x   -0.1806     0.9012   -0.20     0.84    
## condmeans_x2   1.0250     0.0174   58.80   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.1 on 48 degrees of freedom
## Multiple R-squared:  0.999,  Adjusted R-squared:  0.999 
## F-statistic: 2.65e+04 on 2 and 48 DF,  p-value: <2e-16

The results from the two models look very similar. Aside from some minor variations, they provide identical substantive inference about the process at-hand. We can see this if we plot the original data (in gray) and overlay it with the conditional means of y (in red):

plot(y ~ x, col = rgb(0, 0, 0, 0.05), pch = 16)
points(condmeans_y ~ condmeans_x, col = "red", pch = 15)

plot of chunk unnamed-chunk-5

We can add predicted output lines (one for each model) to this plot to see the similarity of the two models even more clearly. Indeed, the lines overlay each other perfectly:

plot(y ~ x, col = rgb(0, 0, 0, 0.05), pch = 16)
points(condmeans_y ~ condmeans_x, col = "red", pch = 15)
lines(0:50, predict(lm1, data.frame(x = 0:50)), col = "green", type = "l", lwd = 2)
lines(0:50, predict(lm2, data.frame(x = 0:50)), col = "blue", type = "l", lwd = 2)

plot of chunk unnamed-chunk-6

So, if you ever struggle to think about what regression is doing, just remember that it is simply drawing a (potentially multidimensional) curve through the conditional means of the outcome at every value of the covariate(s).

OLS Goodness of Fit

When building regression models, one of the biggest question relates to “goodness-of-fit.” How well does our model of the data (i.e., the selected predictor variables) actually “fit” the outcome data? In other words, how much of the variation in an outcome can we explain with a particular model? R provides a number of useful ways of assessing model fit, some of which are common (but not necessarily good) and some which are uncommon (but probably much better). To see these statistics in action, we'll build some fake data and then model using a small, bivariate model that incompletely models the data-generating process and a large, multivariate model that does so much more completely:

set.seed(100)
x1 <- runif(500, 0, 10)
x2 <- rnorm(500, -2, 1)
x3 <- rbinom(500, 1, 0.5)
y <- -1 + x1 + x2 + 3 * x3 + rnorm(500)

Let's generate the bivariate model m1 and store it and its output sm1 for later:

m1 <- lm(y ~ x1)
sm1 <- summary(m1)

Then let's do the same for the multivariate model m2 and its output sm2:

m2 <- lm(y ~ x1 + x2 + x3)
sm2 <- summary(m2)

Below we'll look at some different ways of assessing model fit.

R-Squared

One measure commonly used - perhaps against better wisdom - for assessing model fit is R-squared. R-squared speaks to the proportion of variance in the outcome that can be accounted for by the model. Looking at our simple bivariate model m1, we can extract R-squared as a measure of model fit in a number of ways. The easiest is simply to extract it from the sm1 summary object:

sm1$r.squared
## [1] 0.6223

But we can also calculate R-squared from our data in a number of ways:

cor(y, x1)^2  # manually, as squared bivariate correlation
## [1] 0.6223
var(m1$fitted)/var(y)  # manually, as ratio of variances
## [1] 0.6223
(coef(m1)[2]/sqrt(cov(y, y)/cov(x1, x1)))^2  # manually, as weighted regression coefficient
##     x1 
## 0.6223

Commonly, we actually use the “Adjusted R-squared” because “regular” R-squared is sensitive to the number of independent variables in the model (i.e., as we put more variables into the model, R-squared increases even if those variables are unrelated to the outcome). Adjusted R-squared attempts to correct for this by deflating R-squared by the expect amount of increase from including irrelevant additional predictors. We can see this property of R-squared and Adjusted R-squared by adding two completely random variables unrelated to our other covariates or the outcome into our model and examine the impact on R-squared and Adjusted R-squared.

tmp1 <- rnorm(500, 0, 10)
tmp2 <- rnorm(500, 0, 10)
tmp3 <- rnorm(500, 0, 10)
tmp4 <- rnorm(500, 0, 10)

We can then compare the R-squared from our original bivariate model to that from the garbage dump model:

sm1$r.squared
## [1] 0.6223
summary(lm(y ~ x1 + tmp1 + tmp2 + tmp2 + tmp4))$r.squared
## [1] 0.6289

R-squared increased some, even though these variables are unrelated to y. The adjusted R-squared value also changes, but less so than R-squared:

sm1$adj.r.squared
## [1] 0.6216
summary(lm(y ~ x1 + tmp1 + tmp2 + tmp2 + tmp4))$adj.r.squared
## [1] 0.6259

So, relying on adjusted R-squared is still imperfect and, more than anything, highlights the problems of relying on R-squared (in either form) as a reliable measure of model fit. Of course, when we compare the R-squared and adjusted R-squared values from our bivariate model m1 to our fuller, multivariate model m2, we see appropriate increases in R-squared:

# R-squared
sm1$r.squared
## [1] 0.6223
sm2$r.squared
## [1] 0.9166
# adjusted R-squared
sm1$adj.r.squared
## [1] 0.6216
sm2$adj.r.squared
## [1] 0.9161

In both cases, we see that R-squared and adjusted R-squared increase. The challenge is that because R-squared depends on factors other than model fit, it is an imperfect metric.

Standard Error of the Regression

A very nice way to assess model fit is the standard error of the regression (SER), sometimes just sigma. In R regression output, the value is labeled “Residual standard error” and is stored in a model summary object as sigma. You'll see it near the bottom of model output for the bivariate model m1:

sm1
## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.050 -1.634 -0.143  1.648  5.148 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.4499     0.1974   -7.35  8.4e-13 ***
## x1            0.9707     0.0339   28.65  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.11 on 498 degrees of freedom
## Multiple R-squared:  0.622,  Adjusted R-squared:  0.622 
## F-statistic:  821 on 1 and 498 DF,  p-value: <2e-16
sm1$sigma
## [1] 2.113

This value captures the sum of squared residuals, over the number of degrees of freedom in the model:

sqrt(sum(residuals(m1)^2)/(m1$df.residual))
## [1] 2.113

In other words, the value is proportionate to the standard deviation of the the model residuals. In large samples, it will converge on the standard deviation of the residuals:

sd(residuals(m1))
## [1] 2.111

We can also see it in the multivariate model m2:

sm2$sigma
## [1] 0.9949
sqrt(sum(residuals(m2)^2)/(m2$df.residual))
## [1] 0.9949

Because sigma is a standard deviation (and not a variance), it is on the scale of the original outcome data. Thus, we can actually directly compare the standard deviation of the original outcome data sd(y) to the sigma of any model attepting to account for the variation in y. We see that our models reduce that standard deviation considerably:

sd(y)
## [1] 3.435
sm1$sigma
## [1] 2.113
sm2$sigma
## [1] 0.9949

Because of this inherent comparability of scale, sigma provides a much nicer measure of model fit than R-squared. It can be difficult to interpret how much better a given model fits compared to a baseline model when using R-squared (as we saw just above). By contrast, we can easily quantify the extra explanation done by a larger model by looking at sigma. We can, for example, see that the addition of several random, unrelated variables in our model does almost nothing to sigma:

sm1$sigma
## [1] 2.113
summary(lm(y ~ x1 + tmp1 + tmp2 + tmp2 + tmp4))$sigma
## [1] 2.101

Formal model comparison

While we can see in the sigma values that the multivariate model fit the outcome data better than the bivariate model, the statistics alone don't supply a formal test of that between-model comparison. Typically, such comparisons are - incorrectly - made by comparing the R-squared of different models. Such comparisons are problematic because of the sensitivity of R-squared to the number of included variables (in addition to model fit). We can make a formal comparison between “nested” models (i.e., models with a common outcome, where one contains a subset of the covariates of the other model and no additional coviariates). To do so, we conduct an F-test. The F-test compares the fit of the larger model to the smaller model. By definition, the larger model (i.e., the one with more predictors) will always fit the data better than the smaller model. Thus, just with R-squared, we can be tempted to add more covariates simply to increase model fit even if that increase in fit is not particurly meaningful. The F-test thus compares the residuals from the larger model to the smaller model and tests whether there is a statistically significant reduction in the sum of squared residuals. We execute the test using the anova function:

anova(m1, m2)
## Analysis of Variance Table
## 
## Model 1: y ~ x1
## Model 2: y ~ x1 + x2 + x3
##   Res.Df  RSS Df Sum of Sq   F Pr(>F)    
## 1    498 2224                            
## 2    496  491  2      1733 876 <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The output suggests that our multivariate model does a much better job of fitting the data than the bivariate model, as indicated by the much larger RSS value for m1 (listed first) and the very large F-statistic (and associated very small, heavily-starred p-value). To see what's going on “under the hood,” we can calculate the RSS for both models, the sum of squares, and the associated F-statistic:

sum(m1$residuals^2)                      # residual sum of squares for m1
## [1] 2224
sum(m2$residuals^2)                      # residual sum of squares for m2
## [1] 490.9
sum(m1$residuals^2)-sum(m2$residuals^2)  # sum of squares
## [1] 1733
# the F-statistic
f <- ((sum(m1$residuals^2)-sum(m2$residuals^2))/sum(m2$residuals^2)) * # ratio of sum of squares
    (m2$df.residual/(m1$df.residual-m2$df.residual)) # ratio of degrees of freedom
f
## [1] 875.5
# p-value (from the F distribution)
pf(f, m1$df.residual-m2$df.residual, m2$df.residual, lower.tail = FALSE)
## [1] 1.916e-163

Thus there are some complicated calculations being performed in order for the anova F-test to tell us whether the models differ in their fit. But, those calculations give us very clear inference about any improvement in model fit. Using the F-test rather than an ad-hoc comparison between (adjusted) R-squared values is a much more appropriate comparison of model fits.

Quantile-Quantile (QQ) plots

Another very nice way to assess goodness of fit is to do so visually using the QQ-plot. This plot, in general, compares the distributions of two variables. In the regression context, we can use it to compare the quantiles of the outcome distribution to the quantiles of the distribution of fitted values from the model. To do this in R, we need to extracted the fitted values from our model using the fitted function and the qqplot function to do the plotting. Let's compare the fit of our bivariate model to our multivariate model using two side-by-side qqplots.

layout(matrix(1:2, nrow = 1))
qqplot(y, fitted(m1), col = "gray", pch = 15, ylim = c(-5, 10), main = "Bivariate model")
curve((x), col = "blue", add = TRUE)
qqplot(y, fitted(m2), col = "gray", pch = 15, ylim = c(-5, 10), main = "Multivariate model")
curve((x), col = "blue", add = TRUE)

plot of chunk unnamed-chunk-18

Note: The blue lines represent a y=x line. If the distribution of y and the distribution of the fitted values matched perfectly, then the gray dots would line up perflectly along the y=x line. We see, however, in the bivariate (underspecified) model (left panel) that the fitted values diverge considerably from the distribution of y. By contrast, the fitted values from our multivariate model (right panel) match the disribution of y much more closely. In both plots, however, the models clearly fail to precisely explain extreme values of y. While we cannot summarize the QQ-plot as a single numeric statistic, it provides a very rich characterization of fit that shows not only how well our model fits overall, but also where in the distribution of our outcome the model is doing a better or worse job of explaining the outcome.

Though different from a QQ-plot, we can also plot our fitted values directly again the outcome in order to see how well the model is capturing variation in the outcome. The closer this cloud of points looks to a single, straight line, the better the model fit. Such plots can also help us capture non-linearities and other things in data. Let's compare the fit of the two models side-by-side again:

layout(matrix(1:2, nrow = 1))
plot(y, fitted(m1), col = "gray", pch = 15, ylim = c(-5, 10), main = "Bivariate model")
curve((x), col = "blue", add = TRUE)
plot(y, fitted(m2), col = "gray", pch = 15, ylim = c(-5, 10), main = "Multivariate model")
curve((x), col = "blue", add = TRUE)

plot of chunk unnamed-chunk-19

As above, we see that the bivariate does a particularly power job of explaining extreme cases in y, whereas the multivariate model does much better but remains imperfect (due to random variation in y from when we created the data).

OLS in matrix form

The matrix representation of OLS is (X'X)-1(X'Y). Representing this in R is simple. Let's start with some made up data:

set.seed(1)
n <- 20
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)
X <- cbind(x1, x2, x3)
y <- x1 + x2 + x3 + rnorm(n)

To transpose a matrix, we use the t function:

X
##             x1       x2      x3
##  [1,] -0.62645  0.91898 -0.1645
##  [2,]  0.18364  0.78214 -0.2534
##  [3,] -0.83563  0.07456  0.6970
##  [4,]  1.59528 -1.98935  0.5567
##  [5,]  0.32951  0.61983 -0.6888
##  [6,] -0.82047 -0.05613 -0.7075
##  [7,]  0.48743 -0.15580  0.3646
##  [8,]  0.73832 -1.47075  0.7685
##  [9,]  0.57578 -0.47815 -0.1123
## [10,] -0.30539  0.41794  0.8811
## [11,]  1.51178  1.35868  0.3981
## [12,]  0.38984 -0.10279 -0.6120
## [13,] -0.62124  0.38767  0.3411
## [14,] -2.21470 -0.05381 -1.1294
## [15,]  1.12493 -1.37706  1.4330
## [16,] -0.04493 -0.41499  1.9804
## [17,] -0.01619 -0.39429 -0.3672
## [18,]  0.94384 -0.05931 -1.0441
## [19,]  0.82122  1.10003  0.5697
## [20,]  0.59390  0.76318 -0.1351
t(X)
##       [,1]    [,2]     [,3]    [,4]    [,5]     [,6]    [,7]    [,8]
## x1 -0.6265  0.1836 -0.83563  1.5953  0.3295 -0.82047  0.4874  0.7383
## x2  0.9190  0.7821  0.07456 -1.9894  0.6198 -0.05613 -0.1558 -1.4708
## x3 -0.1645 -0.2534  0.69696  0.5567 -0.6888 -0.70750  0.3646  0.7685
##       [,9]   [,10]  [,11]   [,12]   [,13]    [,14]  [,15]    [,16]
## x1  0.5758 -0.3054 1.5118  0.3898 -0.6212 -2.21470  1.125 -0.04493
## x2 -0.4782  0.4179 1.3587 -0.1028  0.3877 -0.05381 -1.377 -0.41499
## x3 -0.1123  0.8811 0.3981 -0.6120  0.3411 -1.12936  1.433  1.98040
##       [,17]    [,18]  [,19]   [,20]
## x1 -0.01619  0.94384 0.8212  0.5939
## x2 -0.39429 -0.05931 1.1000  0.7632
## x3 -0.36722 -1.04413 0.5697 -0.1351

To multiply two matrices, we use the %*% matrix multiplication operator:

t(X) %*% X
##        x1     x2     x3
## x1 16.573 -3.314  4.711
## x2 -3.314 14.427 -3.825
## x3  4.711 -3.825 12.843

To invert a matrix, we use the solve function:

solve(t(X) %*% X)
##           x1       x2       x3
## x1  0.068634 0.009868 -0.02224
## x2  0.009868 0.076676  0.01922
## x3 -0.022236 0.019218  0.09175

Now let's put all of that together:

solve(t(X) %*% X) %*% t(X) %*% y
##      [,1]
## x1 0.7818
## x2 1.2857
## x3 1.4615

Now let's compare it to the lm function:

lm(y ~ x1 + x2 + x3)$coef
## (Intercept)          x1          x2          x3 
##     0.08633     0.76465     1.27869     1.44705

The numbers are close, but they're not quite right. The reason is that we forgot to include the intercept in our matrix calculation. If we use lm again but leave out the intercept, we'll see this is the case:

lm(y ~ 0 + x1 + x2 + x3)$coef
##     x1     x2     x3 
## 0.7818 1.2857 1.4615

To include the intercept in matrix form, we need to add a vector of 1's to the matrix:

X2 <- cbind(1, X)  #' this uses vector recycling

Now we redo our math:

solve(t(X2) %*% X2) %*% t(X2) %*% y
##       [,1]
##    0.08633
## x1 0.76465
## x2 1.27869
## x3 1.44705

And compare to our full model using lm:

lm(y ~ x1 + x2 + x3)$coef
## (Intercept)          x1          x2          x3 
##     0.08633     0.76465     1.27869     1.44705

The result is exactly what we would expect.

OLS interaction plots

Interactions are important, but they're hard to understand without visualization. This script works through how to visualize interactions in linear regression models.

Plots for identifying interactions

set.seed(1)
x1 <- rnorm(200)
x2 <- rbinom(200, 1, 0.5)
y <- x1 + x2 + (2 * x1 * x2) + rnorm(200)

Interactions (at least in fake data) tend to produce weird plots:

plot(y ~ x1)

plot of chunk unnamed-chunk-2

plot(y ~ x2)

plot of chunk unnamed-chunk-2

This means they also produce weird residual plots:

ols1 <- lm(y ~ x1 + x2)
plot(ols1$residuals ~ x1)

plot of chunk unnamed-chunk-3

plot(ols1$residuals ~ x2)

plot of chunk unnamed-chunk-3

For example, in the first plot we find that there are clearly two relationships between y and x1, one positive and one negative. We thus want to model this using an interaction:

ols2 <- lm(y ~ x1 + x2 + x1:x2)
summary(ols2)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x1:x2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.855 -0.680 -0.002  0.682  3.769 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.0627     0.1103   -0.57     0.57    
## x1            1.1199     0.1258    8.90  3.7e-16 ***
## x2            1.1303     0.1538    7.35  5.3e-12 ***
## x1:x2         1.9017     0.1672   11.37  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.09 on 196 degrees of freedom
## Multiple R-squared:  0.821,  Adjusted R-squared:  0.818 
## F-statistic:  299 on 3 and 196 DF,  p-value: <2e-16

Note: This is equivalent to either of the following:

summary(lm(y ~ x1 + x2 + x1 * x2))
## 
## Call:
## lm(formula = y ~ x1 + x2 + x1 * x2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.855 -0.680 -0.002  0.682  3.769 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.0627     0.1103   -0.57     0.57    
## x1            1.1199     0.1258    8.90  3.7e-16 ***
## x2            1.1303     0.1538    7.35  5.3e-12 ***
## x1:x2         1.9017     0.1672   11.37  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.09 on 196 degrees of freedom
## Multiple R-squared:  0.821,  Adjusted R-squared:  0.818 
## F-statistic:  299 on 3 and 196 DF,  p-value: <2e-16
summary(lm(y ~ x1 * x2))
## 
## Call:
## lm(formula = y ~ x1 * x2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.855 -0.680 -0.002  0.682  3.769 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.0627     0.1103   -0.57     0.57    
## x1            1.1199     0.1258    8.90  3.7e-16 ***
## x2            1.1303     0.1538    7.35  5.3e-12 ***
## x1:x2         1.9017     0.1672   11.37  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.09 on 196 degrees of freedom
## Multiple R-squared:  0.821,  Adjusted R-squared:  0.818 
## F-statistic:  299 on 3 and 196 DF,  p-value: <2e-16

However, specifying only the interaction…

summary(lm(y ~ x1:x2))
## 
## Call:
## lm(formula = y ~ x1:x2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.390 -0.873  0.001  0.971  4.339 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.5305     0.0988    5.37  2.2e-07 ***
## x1:x2         3.0492     0.1415   21.55  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.4 on 198 degrees of freedom
## Multiple R-squared:  0.701,  Adjusted R-squared:   0.7 
## F-statistic:  464 on 1 and 198 DF,  p-value: <2e-16

produces an incomplete (and thus invalid) model. Now let's figure out how to visualize this interaction based upon the complete/correct model.

Start with the raw data

Our example data are particularly simple. There are two groups (defined by x2) and one covariate (x1). We can plot these two groups separately in order to see their distributions of y as a function of x1. We can index our vectors in order to plot the groups separately in red and blue:

plot(x1[x2 == 0], y[x2 == 0], col = rgb(1, 0, 0, 0.5), xlim = c(min(x1), max(x1)), 
    ylim = c(min(y), max(y)))
points(x1[x2 == 1], y[x2 == 1], col = rgb(0, 0, 1, 0.5))

plot of chunk unnamed-chunk-7

It is already clear that there is an interaction. Let's see if we plot the estimated effects.

Predicted outcomes

The easiest way of examining interactions is with predicted outcomes plots. We simply want to show the predicted value of the outcome based upon combinations of input variables. We know that we can do this with the predict function applied to some new data. The expand.grid function is help to build the necessary new data:

xseq <- seq(-5, 5, length.out = 100)
newdata <- expand.grid(x1 = xseq, x2 = c(0, 1))

Let's build a set of predicted values for our no-interaction model:

fit1 <- predict(ols1, newdata, se.fit = TRUE, type = "response")

Then do the same for our full model with the interaction:

fit2 <- predict(ols2, newdata, se.fit = TRUE, type = "response")

Now let's plot the original data, again. Then we'll overlay it with the predicted values for the two groups.

plot(x1[x2 == 0], y[x2 == 0], col = rgb(1, 0, 0, 0.5), xlim = c(min(x1), max(x1)), 
    ylim = c(min(y), max(y)))
points(x1[x2 == 1], y[x2 == 1], col = rgb(0, 0, 1, 0.5))
points(xseq, fit1$fit[1:100], type = "l", col = "red")
points(xseq, fit1$fit[101:200], type = "l", col = "blue")

plot of chunk unnamed-chunk-11

The result is a plot that differentiates the absolute levels of y in the two groups, but forces them to have equivalent slopes. We know this is wrong.

Now let's try to plot the data with the correct fitted values, accounting for the interaction:

plot(x1[x2 == 0], y[x2 == 0], col = rgb(1, 0, 0, 0.5), xlim = c(min(x1), max(x1)), 
    ylim = c(min(y), max(y)))
points(x1[x2 == 1], y[x2 == 1], col = rgb(0, 0, 1, 0.5))
points(xseq, fit2$fit[1:100], type = "l", col = "red")
points(xseq, fit2$fit[101:200], type = "l", col = "blue")

plot of chunk unnamed-chunk-12

This looks better. The fitted values lines correspond nicely to the varying slopes in our two groups. But, we still need to add uncertainty. Luckily, we have the necessarily information in fit2$se.fit.

plot(x1[x2 == 0], y[x2 == 0], col = rgb(1, 0, 0, 0.5), xlim = c(min(x1), max(x1)), 
    ylim = c(min(y), max(y)))
points(x1[x2 == 1], y[x2 == 1], col = rgb(0, 0, 1, 0.5))
points(xseq, fit2$fit[1:100], type = "l", col = "red")
points(xseq, fit2$fit[101:200], type = "l", col = "blue")
points(xseq, fit2$fit[1:100] - fit2$se.fit[1:100], type = "l", col = "red", 
    lty = 2)
points(xseq, fit2$fit[1:100] + fit2$se.fit[1:100], type = "l", col = "red", 
    lty = 2)
points(xseq, fit2$fit[101:200] - fit2$se.fit[101:200], type = "l", col = "blue", 
    lty = 2)
points(xseq, fit2$fit[101:200] + fit2$se.fit[101:200], type = "l", col = "blue", 
    lty = 2)

plot of chunk unnamed-chunk-13

We can also produce the same plot through bootstrapping.

tmpdata <- data.frame(x1 = x1, x2 = x2, y = y)
myboot <- function() {
    thisboot <- sample(1:nrow(tmpdata), nrow(tmpdata), TRUE)
    coef(lm(y ~ x1 * x2, data = tmpdata[thisboot, ]))
}
bootcoefs <- replicate(2500, myboot())
plot(x1[x2 == 0], y[x2 == 0], col = rgb(1, 0, 0, 0.5), xlim = c(min(x1), max(x1)), 
    ylim = c(min(y), max(y)))
points(x1[x2 == 1], y[x2 == 1], col = rgb(0, 0, 1, 0.5))
apply(bootcoefs, 2, function(coefvec) {
    points(xseq, coefvec[1] + (xseq * coefvec[2]), type = "l", col = rgb(1, 
        0, 0, 0.01))
    points(xseq, coefvec[1] + (xseq * (coefvec[2] + coefvec[4])) + coefvec[3], 
        type = "l", col = rgb(0, 0, 1, 0.01))

})
## NULL
points(xseq, fit2$fit[1:100], type = "l")
points(xseq, fit2$fit[101:200], type = "l")
points(xseq, fit2$fit[1:100] - fit2$se.fit[1:100], type = "l", lty = 2)
points(xseq, fit2$fit[1:100] + fit2$se.fit[1:100], type = "l", lty = 2)
points(xseq, fit2$fit[101:200] - fit2$se.fit[101:200], type = "l", lty = 2)
points(xseq, fit2$fit[101:200] + fit2$se.fit[101:200], type = "l", lty = 2)

plot of chunk unnamed-chunk-14

If we overlay our previous lines of top of this, we see that they produce the same result, above.

Of course, we may want to show confidence intervals rather than SEs. And this is simple. We can reproduce the graph with 95% confidence intervals, using qnorm to determine how much to multiple our SEs by.

plot(x1[x2 == 0], y[x2 == 0], col = rgb(1, 0, 0, 0.5), xlim = c(min(x1), max(x1)), 
    ylim = c(min(y), max(y)))
points(x1[x2 == 1], y[x2 == 1], col = rgb(0, 0, 1, 0.5))
points(xseq, fit2$fit[1:100], type = "l", col = "red")
points(xseq, fit2$fit[101:200], type = "l", col = "blue")
points(xseq, fit2$fit[1:100] - qnorm(0.975) * fit2$se.fit[1:100], type = "l", 
    lty = 2, col = "red")
points(xseq, fit2$fit[1:100] + qnorm(0.975) * fit2$se.fit[1:100], type = "l", 
    lty = 2, col = "red")
points(xseq, fit2$fit[101:200] - qnorm(0.975) * fit2$se.fit[101:200], type = "l", 
    lty = 2, col = "blue")
points(xseq, fit2$fit[101:200] + qnorm(0.975) * fit2$se.fit[101:200], type = "l", 
    lty = 2, col = "blue")

plot of chunk unnamed-chunk-15

Incorrect models (without constituent terms)

We can also use plots to visualize why we need to include constitutive terms in our interaction models. Recall that our model is defined as:

ols2 <- lm(y ~ x1 + x2 + x1:x2)

We can compare this to a model with only one term and the interaction:

ols3 <- lm(y ~ x1 + x1:x2)
fit3 <- predict(ols3, newdata, se.fit = TRUE, type = "response")

And plot its results:

plot(x1[x2 == 0], y[x2 == 0], col = rgb(1, 0, 0, 0.5), xlim = c(min(x1), max(x1)), 
    ylim = c(min(y), max(y)))
points(x1[x2 == 1], y[x2 == 1], col = rgb(0, 0, 1, 0.5))
points(xseq, fit3$fit[1:100], type = "l", col = "red")
points(xseq, fit3$fit[101:200], type = "l", col = "blue")
# We can compare these lines to those from the full model:
points(xseq, fit2$fit[1:100], type = "l", col = "red", lwd = 2)
points(xseq, fit2$fit[101:200], type = "l", col = "blue", lwd = 2)

plot of chunk unnamed-chunk-18

By leaving out a term, we misestimate the effect of x1 in both groups.

Ordered Outcome Models

This tutorial focuses on ordered outcome regression models. R's base glm function does not support these, but they're very easy to execute using the MASS package, which is a recommended package.

library(MASS)

Ordered outcome data always create a bit of tension when it comes to analysis because it presents many options for how to analyze it. For example, imagine we are looking at the effect of some independent variables on response to a survey question that measures opinion on a five-point scale from extremely supportive to extremely opposed. We could dichotomize the measure to compare support versus opposition with a binary model. We could also assume that the categories are spaced equidistant on a latent scale and simply model the outcome using a linear model. Or, finally, we could use an ordered model (e.g., ordered logit or ordered probit) to model the unobserved latent scale of the outcome without requiring that the outcome categories are equidistant on that scale. We'll focus on the last of these options here, with comparison to the binary and linear alternative specifications.

Let's start by creating some data that have a linear relationship between an outcome y and two covariates x1 and x2:

set.seed(500)
x1 <- runif(500, 0, 10)
x2 <- rbinom(500, 1, 0.5)
y <- x1 + x2 + rnorm(500, 0, 3)

The y vector is our latent linear scale that we won't actually observe. Instead let's collapse the y variable into a new variable y2, which will serve as our observed data and has 5 categories. We can do this using the cut function:

y2 <- as.numeric(cut(y, 5))

Now let's plot our “observed” data y2 against our independent variables. We'll plot the values for x2==1 and x2==0 separately just to visualize the data. And we'll additionally fit a linear model to the data and draw separate lines for predictin y2 for values of x2==0 and x2==1 (which will be parallel lines):

lm1 <- lm(y2 ~ x1 + x2)
plot(y2[x2 == 0] ~ x1[x2 == 0], col = rgb(1, 0, 0, 0.2), pch = 16)
points(y2[x2 == 1] ~ x1[x2 == 1], col = rgb(0, 0, 1, 0.2), pch = 16)
abline(coef(lm1)[1], coef(lm1)[2], col = "red", lwd = 2)
abline(coef(lm1)[1] + coef(lm1)[3], coef(lm1)[2], col = "blue", lwd = 2)

plot of chunk unnamed-chunk-4

The plot actually seems like a decent fit, but let's remember that the linear model is trying to predict the conditional means if of our outcome y2 for each value of x but those conditional means can be kind of meaningless when our outcome can only take specific values rather than all values. Let's redraw the plot with points for the conditional means (at 10 values of x1) to see the problem:

plot(y2[x2 == 0] ~ x1[x2 == 0], col = rgb(1, 0, 0, 0.2), pch = 16)
points(y2[x2 == 1] ~ x1[x2 == 1], col = rgb(0, 0, 1, 0.2), pch = 16)
x1cut <- as.numeric(cut(x1, 10))
s <- sapply(unique(x1cut), function(i) {
    points(i, mean(y2[x1cut == i & x2 == 0]), col = "red", pch = 15)
    points(i, mean(y2[x1cut == i & x2 == 1]), col = "blue", pch = 15)
})
# redraw the regression lines:
abline(coef(lm1)[1], coef(lm1)[2], col = "red", lwd = 1)
abline(coef(lm1)[1] + coef(lm1)[3], coef(lm1)[2], col = "blue", lwd = 1)

plot of chunk unnamed-chunk-5

Estimating ordered logit and probit models

Overall, then, the previous approach doesn't seem to be doing that great of a job and the output of the model will be continuous values that fall outside of the set of discrete values we actually observed for y2. Instead, we should try an ordered model (either ordered logit or ordered probit). To estimate these models we need to use the polr function from the MASS package. We can use the same formula interface that we used for the linear model. The default is an ordered logit model, but we can easily specify probit using a method='probot' argument. Note: One important issue is that the outcome needs to be a “factor” class object. But we can specify this atomically in the call to polr:

ologit <- polr(factor(y2) ~ x1 + x2)
oprobit <- polr(factor(y2) ~ x1 + x2, method = "probit")

Let's look at the summaries of these objects, just to get familiar with the output:

summary(ologit)
## 
## Re-fitting to get Hessian
## Error: object 'y2' not found
summary(oprobit)
## 
## Re-fitting to get Hessian
## Error: object 'y2' not found

The output looks similar to a linear model but now instead of a single intercept, we have a set of intercepts listed separately from the other coefficients. These intercepts speak to the points (on a latent dimension) where the outcome transitions from one category to the next. Because they're on a latent scale, they're not particularly meaningful to us. Indeed, even the coefficients aren't particularly meaningful. Unlike in OLS, these are not directly interpretable. So let's instead look at some predicted probabilities.

Predicted outcomes for ordered models

Predicted probabilities can be estimated in the same way for ordered models as for binary GLMs. We simply need to create some covariate data over which we want to estimate predicted probabilities and then run predict. We'll use expand.grid to create our newdata dataframe because we have two covariates and it simplifies creating data at each possible level of both variables.

newdata <- expand.grid(seq(0, 10, length.out = 100), 0:1)
names(newdata) <- c("x1", "x2")

When estimating outcomes we can actually choose between getting the discrete fitted class (i.e., which value of the outcome is most likely at each value of covariates) or the predicted probabilities. We'll get both for the logit model just to compare:

plogclass <- predict(ologit, newdata, type = "class")
plogprobs <- predict(ologit, newdata, type = "probs")

If we look at the head of each object, we'll see that when type='class', the result is a single vector of discete fitted values, whereas when type='probs', the response is a matrix where (for each observation in our new data) the predicted probability of being in each outcome category is specified.

head(plogclass)
## [1] 2 2 2 2 2 2
## Levels: 1 2 3 4 5
head(plogprobs)
##        1      2       3        4         5
## 1 0.3564 0.5585 0.07889 0.005966 0.0002858
## 2 0.3428 0.5673 0.08326 0.006330 0.0003034
## 3 0.3295 0.5756 0.08785 0.006715 0.0003220
## 4 0.3165 0.5833 0.09267 0.007124 0.0003418
## 5 0.3038 0.5906 0.09771 0.007558 0.0003627
## 6 0.2913 0.5973 0.10299 0.008018 0.0003850

Note: The predicted probabilities necessarily sum to 1 in ordered models:

rowSums(plogprobs)
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 199 200 
##   1   1

The easiest way to make sense of these predictions is through plotting. Let's start by plotting the original data and then overlying, as horizontal lines, the predicted classes for each value of x1 and x2:

plot(y2[x2 == 0] ~ x1[x2 == 0], col = rgb(1, 0, 0, 0.2), pch = 16, xlab = "x1", 
    ylab = "y")
points(y2[x2 == 1] ~ x1[x2 == 1], col = rgb(0, 0, 1, 0.2), pch = 16)
s <- sapply(1:5, function(i) lines(newdata$x1[plogclass == i & newdata$x2 == 
    0], as.numeric(plogclass)[plogclass == i & newdata$x2 == 0] + 0.1, col = "red", 
    lwd = 3))
s <- sapply(1:5, function(i) lines(newdata$x1[plogclass == i & newdata$x2 == 
    1], as.numeric(plogclass)[plogclass == i & newdata$x2 == 1] - 0.1, col = "blue", 
    lwd = 3))

plot of chunk unnamed-chunk-12

Note: We've drawn the predicted classes separately for x2==0 (red) and x2==1 (blue) and offset them vertically to see their values and the underlying data. The above plot shows, for each combination of values of x1 and x2, what the most likely category to observe for y2 is. Thus, where one horizontal bar ends, the next begins (i.e., the blue bars do not overlap each other and neither do the red bars). You'll also note for these data that the predictions are never expected to be in y==1 or y==5, even though some of our observed y values are.

Predicted probabilities for ordered models

Now that we've seen the fitted classes, we should acknowledge that we have some uncertainty about those classes. It's simply the most likely class for each observation to take, but there is a defined probability that we'll see values in each of the other y classes. We can see this if we plot our predicted probability object plogprobs. We'll plot predicted probabilities when x2==0 on the left and when x2==1 on the right. The colored lines represent the predicted probability of falling in each category of y2 (in rainbow order, so that red represents y2==1 and purple represents y2==5). We'll draw a thick horizontal line at the bottom of the plot representing the predicted classes at each value of x1 and x2:

layout(matrix(1:2, nrow = 1))
# plot for `x2==0`
plot(NA, xlim = c(min(x1), max(x1)), ylim = c(0, 1), xlab = "x1 (x2==0)", ylab = "Predicted Probability (Logit)")
s <- mapply(function(i, col) lines(newdata$x1[newdata$x2 == 0], plogprobs[newdata$x2 == 
    0, i], lwd = 1, col = col), 1:5, rainbow(5))
# optional horizontal line representing predicted class
s <- mapply(function(i, col) lines(newdata$x1[plogclass == i & newdata$x2 == 
    0], rep(0, length(newdata$x1[plogclass == i & newdata$x2 == 0])), col = col, 
    lwd = 3), 1:5, rainbow(5))
# plot for `x2==1`
plot(NA, xlim = c(min(x1), max(x1)), ylim = c(0, 1), xlab = "x1 (x2==1)", ylab = "Predicted Probability (Logit)")
s <- mapply(function(i, col) lines(newdata$x1[newdata$x2 == 1], plogprobs[newdata$x2 == 
    1, i], lwd = 1, col = col), 1:5, rainbow(5))
# optional horizontal line representing predicted class
s <- mapply(function(i, col) lines(newdata$x1[plogclass == i & newdata$x2 == 
    1], rep(0, length(newdata$x1[plogclass == i & newdata$x2 == 1])), col = col, 
    lwd = 3), 1:5, rainbow(5))

plot of chunk unnamed-chunk-13

We can see that the predicted probability curves strictly follow the logistic distribution (due to our use of a logit model). The lefthand plot also shows what we noted in the earlier plot: when x2==0, the model never predicts y2==5.

Note: We can redraw the same plot using prediction values from our ordered probit model and obtain essentially the same inference:

oprobprobs <- predict(oprobit, newdata, type = "probs")
layout(matrix(1:2, nrow = 1))
plot(NA, xlim = c(min(x1), max(x1)), ylim = c(0, 1), xlab = "x1 (x2==0)", ylab = "Predicted Probability (Probit)")
s <- mapply(function(i, col) lines(newdata$x1[newdata$x2 == 0], oprobprobs[newdata$x2 == 
    0, i], lwd = 1, col = col), 1:5, rainbow(5))
plot(NA, xlim = c(min(x1), max(x1)), ylim = c(0, 1), xlab = "x1 (x2==1)", ylab = "Predicted Probability (Probit)")
s <- mapply(function(i, col) lines(newdata$x1[newdata$x2 == 1], oprobprobs[newdata$x2 == 
    1, i], lwd = 1, col = col), 1:5, rainbow(5))

plot of chunk unnamed-chunk-14

Alternative predicted probability plot

Though the above plot predicted probabilities plots communicate a lot of information. We can also present predicted probabilities in a different way. Because we use ordered outcome regression models when we believe the outcome has a meaningful ordinal scale, it may make sense to present the predicted probabilities stacked on top of one another as a “stacked area chart” (since they sum to 1 for every combination of covariates) to differently communicate the relative probability of being in each outcome class at each combination of covariates. To do this, we need to write a little bit of code to prep our data. Specifically, our plogprobs object is a matrix where, for each row, the columns are predicted probabilities of being in each category of the outcome. In order to plot them stacked on top of one another, we need the value in each column to instead be the cumulative probability (calculated left-to-right across the matrix). Luckily R has some nice built in function to do this. cumsum returns the cumulative sum at each position of a vector. We can use apply to calculate this cumulative sum for each row of the plogprobs matrix, and then we simply need to transpose that result using the t function to simply some things later on. Let's try it out:

cumprobs <- t(apply(plogprobs, 1, cumsum))
head(cumprobs)
##        1      2      3      4 5
## 1 0.3564 0.9149 0.9937 0.9997 1
## 2 0.3428 0.9101 0.9934 0.9997 1
## 3 0.3295 0.9051 0.9930 0.9997 1
## 4 0.3165 0.8999 0.9925 0.9997 1
## 5 0.3038 0.8944 0.9921 0.9996 1
## 6 0.2913 0.8886 0.9916 0.9996 1

Note: The cumulative probabilities will always be 1 for category 5 because rows sum to 1. To plot this, we simply need to draw these new values on our plot. We'll again separate data for x2==0 from x2==1.

layout(matrix(1:2, nrow = 1))
plot(NA, xlim = c(min(x1), max(x1)), ylim = c(0, 1), xlab = "x1 (x2==0)", ylab = "Cumulative Predicted Probability (Logit)")
s <- mapply(function(i, col) lines(newdata$x1[newdata$x2 == 0], cumprobs[newdata$x2 == 
    0, i], lwd = 1, col = col), 1:5, rainbow(5))
plot(NA, xlim = c(min(x1), max(x1)), ylim = c(0, 1), xlab = "x1 (x2==1)", ylab = "Cumulative Predicted Probability (Logit)")
s <- mapply(function(i, col) lines(newdata$x1[newdata$x2 == 1], cumprobs[newdata$x2 == 
    1, i], lwd = 1, col = col), 1:5, rainbow(5))

plot of chunk unnamed-chunk-16

The result is a stacked area chart showing the cumulative probability of being in a set of categories of y. If we think back to the first example at the top of this tutorial - about predicting opinions on a five-point scale - we could interpret the above plot as the cumulative probability of, e.g., opposing the issue. If y==1 (red) and y==2 (yellow) represent strong and weak opposition, respectively, we could interpret the above lefthand plot as saying that when x1==0, there is about a 40% chance that an individual strongly opposes and an over 90% chance that they will oppose strongly or weakly. This plot makes it somewhat more difficult to figure out what the most likely outcome category is, but it helps for making these kind of cumulative prediction statements. To see the most likely category, we have to visually estimate the widest vertical distance between lines at any given values of x1, which can be tricky.

We can also use the polygon plotting function to draw areas rather than lines, which produces a slight different effect:

layout(matrix(1:2, nrow = 1))
plot(NA, xlim = c(min(x1), max(x1)), ylim = c(0, 1), xlab = "x1 (x2==0)", ylab = "Cumulative Predicted Probability (Logit)", 
    bty = "l")
s <- mapply(function(i, col) polygon(c(newdata$x1[newdata$x2 == 0], rev(newdata$x1[newdata$x2 == 
    0])), c(cumprobs[newdata$x2 == 0, i], rep(0, length(newdata$x1[newdata$x2 == 
    0]))), lwd = 1, col = col, border = col), 5:1, rev(rainbow(5)))
plot(NA, xlim = c(min(x1), max(x1)), ylim = c(0, 1), xlab = "x1 (x2==1)", ylab = "Cumulative Predicted Probability (Logit)", 
    bty = "l")
s <- mapply(function(i, col) polygon(c(newdata$x1[newdata$x2 == 1], rev(newdata$x1[newdata$x2 == 
    1])), c(cumprobs[newdata$x2 == 1, i], rep(0, length(newdata$x1[newdata$x2 == 
    1]))), lwd = 1, col = col, border = col), 5:1, rev(rainbow(5)))

plot of chunk unnamed-chunk-17

Note: We draw the polygons in reverse order so that the lower curves are drawn on top of the higher curves.

Permutation Tests

An increasingly common statistical tool for constructing sampling distributions is the permutation test (or sometimes called a randomization test). Like bootstrapping, a permutation test builds - rather than assumes - sampling distribution (called the “permutation distribution”) by resampling the observed data. Specifically, we can “shuffle” or permute the observed data (e.g., by assigning different outcome values to each observation from among the set of actually observed outcomes). Unlike bootstrapping, we do this without replacement.

Permutation tests are particularly relevant in experimental studies, where we are often interested in the sharp null hypothesis of no difference between treatment groups. In these situations, the permutation test perfectly represents our process of inference because our null hypothesis is that the two treatment groups do not differ on the outcome (i.e., that the outcome is observed independently of treatment assignment). When we permute the outcome values during the test, we therefore see all of the possible alternative treatment assignments we could have had and where the mean-difference in our observed data falls relative to all of the differences we could have seen if the outcome was independent of treatment assignment. While a permutation test requires that we see all possible permutations of the data (which can become quite large), we can easily conduct “approximate permutation tests” by simply conducting a vary large number of resamples. That process should, in expectation, approximate the permutation distribution.

For example, if we have only n=20 units in our study, the number of permutations is:

factorial(20)
## [1] 2.433e+18

That number exceeds what we can reasonably compute. But we can randomly sample from that permutation distribution to obtain the approximate permutation distribution, simply by running a large number of resamples. Let's look at this as an example using some made up data:

set.seed(1)
n <- 100
tr <- rbinom(100, 1, 0.5)
y <- 1 + tr + rnorm(n, 0, 3)

The difference in means is, as we would expect (given we made it up), about 1:

diff(by(y, tr, mean))
## [1] 1.341

To obtain a single permutation of the data, we simply resample without replacement and calculate the difference again:

s <- sample(tr, length(tr), FALSE)
diff(by(y, s, mean))
## [1] -0.2612

Here we use the permuted treatment vector s instead of tr to calculate the difference and find a very small difference. If we repeat this process a large number of times, we can build our approximate permutation distribution (i.e., the sampling distribution for the mean-difference). We'll use replicate do repeat our permutation process. The result will be a vector of the differences from each permutation (i.e., our distribution):

dist <- replicate(2000, diff(by(y, sample(tr, length(tr), FALSE), mean)))

We can look at our distribution using hist and draw a vertical line for our observed difference:

hist(dist, xlim = c(-3, 3), col = "black", breaks = 100)
abline(v = diff(by(y, tr, mean)), col = "blue", lwd = 2)

plot of chunk unnamed-chunk-6

At face value, it seems that our null hypothesis can probably be rejected. Our observed mean-difference appears to be quite extreme in terms of the distribution of possible mean-differences observable were the outcome independent of treatment assignment. But we can use the distribution to obtain a p-value for our mean-difference by counting how many permuted mean-differences are larger than the one we observed in our actual data. We can then divide this by the number of items in our permutation distribution (i.e., 2000 from our call to replicate, above):

sum(dist > diff(by(y, tr, mean)))/2000  # one-tailed test
## [1] 0.009
sum(abs(dist) > abs(diff(by(y, tr, mean))))/2000  # two-tailed test
## [1] 0.018

Using either the one-tailed test or the two-tailed test, our difference is unlikely to be due to chance variation observable in a world where the outcome is independent of treatment assignment.

library(coin)

We don't always need to build our own permutation distributions (though it is good to know how to do it). R provides a package to conduct permutation tests called coin. We can compare our p-value (and associated inference) from above with the result from coin:

library(coin)
independence_test(y ~ tr, alternative = "greater")  # one-tailed
## 
##  Asymptotic General Independence Test
## 
## data:  y by tr
## Z = 2.315, p-value = 0.01029
## alternative hypothesis: greater
independence_test(y ~ tr)  # two-tailed
## 
##  Asymptotic General Independence Test
## 
## data:  y by tr
## Z = 2.315, p-value = 0.02059
## alternative hypothesis: two.sided

Clearly, our approximate permutation distribution provided the same inference and a nearly identical p-value. coin provides other permutation tests for different kinds of comparisons, as well. Almost anything that you can address in a parametric framework can also be done in a permutation framework (if substantively appropriate). and anything that coin doesn't provide, you can build by hand with the basic permutation logic of resampling.

Plots as data summary

While we can use tables and statistics to summarize data, it is often use to visually summarize data. This script describes how to produce some common summary plots.

Histogram

The simplest plot is a histogram, which shows the frequencies of different values in a distribution. Drawing a basic histogram in R is easy. First, let's generate a random vector:

set.seed(1)
a <- rnorm(30)

Then we can draw a histogram of the data:

hist(a)

plot of chunk unnamed-chunk-2

This isn't the most attractive plot, though, and we can easily make it look different:

hist(a, col = "gray20", border = "lightgray")

plot of chunk unnamed-chunk-3

Density plot

Another approach to summarizing the distribution of a variable is a density plot. This visualization is basically a “smoothed” histogram and it's easy to plot, too.

plot(density(a))

plot of chunk unnamed-chunk-4

Clearly, the two plots give us similar information. We can even overlay them. Doing so requires a few modifications to our code, though.

hist(a, freq = FALSE, col = "gray20", border = "lightgray")
lines(density(a), col = "red", lwd = 2)

plot of chunk unnamed-chunk-5

Barplot

One of the simplest data summaries is a barplot. Like a histogram, it shows bars. But those bars are statistics rather than just counts (though they could be counts). We can make a barplot from a vector of numeric values:

b <- c(3, 4.5, 5, 8, 3, 6)
barplot(b)

plot of chunk unnamed-chunk-6

The result is something visually very similar to the histogram. We can easily label the bars by specifying a names.arg parameter:

barplot(b, names.arg = letters[1:6])

plot of chunk unnamed-chunk-7

We can also turn the plot on its side, if that looks better:

barplot(b, names.arg = letters[1:6], horiz = TRUE)

plot of chunk unnamed-chunk-8

We can also create a stacked barplot by providing a matrix rather than a vector of input data. Let's say we have counts of two types of objects (e.g., coins) from three groups:

d <- rbind(c(2, 4, 1), c(6, 1, 3))
d
##      [,1] [,2] [,3]
## [1,]    2    4    1
## [2,]    6    1    3
barplot(d, names.arg = letters[1:3])

plot of chunk unnamed-chunk-9

Instead of stacking the bars of each type, we can present them side by side using the beside parameter:

barplot(d, names.arg = letters[1:3], beside = TRUE)

plot of chunk unnamed-chunk-10

Dotchart

Rather than waste a lot of ink on bars, we can see the same kinds of relationships in dotcharts.

dotchart(b, labels = letters[1:6])

plot of chunk unnamed-chunk-11

As we can see, the barplot and the dotchart communicate the same information in more or less the same way:

layout(matrix(1:2, nrow = 1))
barplot(b, names.arg = letters[1:6], horiz = TRUE, las = 2)
dotchart(b, labels = letters[1:6], xlim = c(0, 8))

plot of chunk unnamed-chunk-12

Boxplot

It is often helpful to describe the distribution of those data with a box plot. The boxplot describes any continuous vector of data by showing the five number summary and any outliers:

boxplot(a)

plot of chunk unnamed-chunk-13

It can also compare distributions in two or more groups:

e <- rnorm(100, 1, 1)
f <- rnorm(100, 2, 4)
boxplot(e, f)

plot of chunk unnamed-chunk-14

We can also use a “formula” description of data if one of our variables describes which group our observations fall into:

g1 <- c(e, f)
g2 <- rep(c(1, 2), each = 100)
boxplot(g1 ~ g2)

plot of chunk unnamed-chunk-15

As we can see, both of these last two plots are identical. They're just different ways of telling boxplot what to plot.

Scatterplot

When we want to describe the relationships among variables, we often want a scatterplot.

x1 <- rnorm(1000)
x2 <- rnorm(1000)
x3 <- x1 + x2
x4 <- x1 + x3

We can draw a scatterplot in one of two ways. (1) Naming vectors as sequential arguments:

plot(x1, x2)

plot of chunk unnamed-chunk-17

(2) Using a “formula” interface:

plot(x2 ~ x1)

plot of chunk unnamed-chunk-18

We can plot the relationship between x1 and the three other variables:

layout(matrix(1:3, nrow = 1))
plot(x1, x2)
plot(x1, x3)
plot(x1, x4)

plot of chunk unnamed-chunk-19

We can also use the pairs function to do this for all relationships between all variables:

pairs(~x1 + x2 + x3 + x4)

plot of chunk unnamed-chunk-20

This allows us to visualize a lot of information very quickly.

Plotting regression summaries

The olsplots.r script walked through plotting regression diagnostics. Here we focus on plotting regression results.

Plotting regression slopes

Because the other script described plotting slopes to some extent, we'll start there. Once we have a regression model, it's incredibly easy to plot slopes using abline:

set.seed(1)
x1 <- rnorm(100)
y1 <- x1 + rnorm(100)
ols1 <- lm(y1 ~ x1)
plot(y1 ~ x1, col = "gray")

plot of chunk unnamed-chunk-1

Note: plot(y1~x1) is equivalent to plot(x1,y1), with reversed order of terms.

abline(coef(ols1)[1], coef(ols1)["x1"], col = "red")
## Error: plot.new has not been called yet

This is a nice plot, but it doesn't show uncertainty. To add uncertainty about our effect, let's try bootstrapping our standard errors.

To bootstrap, we resample or original data, reestimate the model and redraw our line. We're going to do some functional programming to make this happen.

myboot <- function() {
    tmpdata <- data.frame(x1 = x1, y1 = y1)
    thisboot <- sample(1:nrow(tmpdata), nrow(tmpdata), TRUE)
    coef(lm(y1 ~ x1, data = tmpdata[thisboot, ]))
}
bootcoefs <- replicate(2500, myboot())

The result bootcoefs is 2500 bootstrapped OLS estimates We can add these all to our plot using a function called apply:

plot(y1 ~ x1, col = "gray")
apply(bootcoefs, 2, abline, col = rgb(1, 0, 0, 0.01))

plot of chunk unnamed-chunk-4

## NULL

The darkest parts of this plot show where we have the most certainty about the our expected values. At the tails of the plot, because of the uncertainty about our slope, the range of plausible predicted values is greater.

We can also get a similar looking plot using mathematically calculated SEs. The predict function will help us determine the predicted values from a regression models at different inputs. To use it, we generate some new data representing the range of observed values of our data:

new1 <- data.frame(x1 = seq(-3, 3, length.out = 100))

We then do the prediction, specifying our model (ols1), the new data (new1), that we want SEs, and that we want “response” predictions.

pred1 <- predict(ols1, newdata = new1, se.fit = TRUE, type = "response")

We can then plot our data:

plot(y1 ~ x1, col = "gray")
# Add the predicted line of best (i.e., the regression line:
points(pred1$fit ~ new1$x1, type = "l", col = "blue")
# Note: This is equivalent to `abline(coef(ols1)[1] ~ coef(ols1)[2],
# col='red')` over the range (-3,3).  Then we add our confidence intervals:
lines(new1$x1, pred1$fit + (1.96 * pred1$se.fit), lty = 2, col = "blue")
lines(new1$x1, pred1$fit - (1.96 * pred1$se.fit), lty = 2, col = "blue")

plot of chunk unnamed-chunk-7

Note: The lty parameter means “line type.” We've requested a dotted line.

We can then compare the two approaches by plotting them together:

plot(y1 ~ x1, col = "gray")
apply(bootcoefs, 2, abline, col = rgb(1, 0, 0, 0.01))
## NULL
points(pred1$fit ~ new1$x1, type = "l", col = "blue")
lines(new1$x1, pred1$fit + (1.96 * pred1$se.fit), lty = 2, col = "blue")
lines(new1$x1, pred1$fit - (1.96 * pred1$se.fit), lty = 2, col = "blue")

plot of chunk unnamed-chunk-8

As should be clear, both give us essentially the same representation of uncertainty, but in sylistically different ways.

It is also possible to draw a shaded region rather than the blue lines in the above example. To do this we use the polygon function, which we have to feed some x and y positions of points:

plot(y1 ~ x1, col = "gray")
polygon(c(seq(-3, 3, length.out = 100), rev(seq(-3, 3, length.out = 100))), 
    c(pred1$fit - (1.96 * pred1$se.fit), rev(pred1$fit + (1.96 * pred1$se.fit))), 
    col = rgb(0, 0, 1, 0.5), border = NA)

plot of chunk unnamed-chunk-9

Alternatively, we might want to show different confidence intervals with this kind of polygon:

plot(y1 ~ x1, col = "gray")
# 67% CI To draw the polygon, we have to specify the x positions of the
# points from our predictions.  We do this first left to right (for the
# lower CI limit) and then right to left (for the upper CI limit).  Then we
# specify the y positions, which are just the outputs from the `predict`
# function.
polygon(c(seq(-3, 3, length.out = 100), rev(seq(-3, 3, length.out = 100))), 
    c(pred1$fit - (qnorm(0.835) * pred1$se.fit), rev(pred1$fit + (qnorm(0.835) * 
        pred1$se.fit))), col = rgb(0, 0, 1, 0.2), border = NA)
# Note: The `qnorm` function tells us how much to multiple our SEs by to get
# Gaussian CIs.  95% CI
polygon(c(seq(-3, 3, length.out = 100), rev(seq(-3, 3, length.out = 100))), 
    c(pred1$fit - (qnorm(0.975) * pred1$se.fit), rev(pred1$fit + (qnorm(0.975) * 
        pred1$se.fit))), col = rgb(0, 0, 1, 0.2), border = NA)
# 99% CI
polygon(c(seq(-3, 3, length.out = 100), rev(seq(-3, 3, length.out = 100))), 
    c(pred1$fit - (qnorm(0.995) * pred1$se.fit), rev(pred1$fit + (qnorm(0.995) * 
        pred1$se.fit))), col = rgb(0, 0, 1, 0.2), border = NA)
# 99.9% CI
polygon(c(seq(-3, 3, length.out = 100), rev(seq(-3, 3, length.out = 100))), 
    c(pred1$fit - (qnorm(0.9995) * pred1$se.fit), rev(pred1$fit + (qnorm(0.9995) * 
        pred1$se.fit))), col = rgb(0, 0, 1, 0.2), border = NA)

plot of chunk unnamed-chunk-10

Power, Effect Sizes, and Minimum Detectable Effects

When designing an experiment, we generally want to be able to create an experiment that adequately tests our hypothesis. Accomplishing this requires having sufficient “power” to detect any effects. Power is sometimes also called “sensitivity.” Power refers to the ability of a test (i.e., an analysis of an experiment) to detect a “true effect” that is different from the null hypothesis (e.g., the ability to detect a difference between treatment and control when that difference actually exists).

Factors influencing power

There are four factors that influence power: sample size, the true effect size, the variance of the effect, and the alpha-threshold (level of significance). The most important factor in power is sample size. Larger samples have more power than small samples, but the gain is power is non-linear. There is a declining marginal return (in terms of power) for each additional unit in the experiment. So designing an experiment trades off power with cost-like considerations. The alpha level (level of significance) also influences power. If we have a more liberal threshold (i.e., a higher alpha level), we have more power to detect the effect. But this higher power is due to the fact that the more liberal treshold also increases our “false positive” rate, where the analysis is more likely to say there is an effect when in fact there is not. So, again, there is a trade-off between detecting a true effect and avoiding false detections.

Power of a t-test

One of the simplest examples of power involves looking at a common statistical test for analyzing experiments: the t-test. The t-test looks at the difference in means for two groups or the difference between one group and a null hypothesis value (often zero).

t.test()
## Error: argument "x" is missing, with no default
power.t.test()
## Error: exactly one of 'n', 'delta', 'sd', 'power', and 'sig.level' must be
## NULL

Minimum detectable effect size

Education researcher Howard Bloom has suggested that power is a difficult concept to grasp. He instead suggests we rely on a measure of “minimum detectable effect” (MDE) to discuss experiments. He's probably right. MDE tell us what is the smallest true effect, in standard deviations of the outcome, that is detectable for a given level of power and statistical significance. Because the standard deviation is influenced by sample size, MDE incorporates all of the information of a power calculation but does so in a way that applies to all experiments. That is to say, as long as we can guess at the variance of our outcome, the same sample size considerations apply every time we conduct any experiment.

For a one-tailed test:

sigma <- 1
sigma * qnorm((1 - 0.05)) + sigma * qnorm(0.8)
## [1] 2.486

For a two-tailed test:

sigma * qnorm((1 - (0.5 * 0.05))) + sigma * qnorm(0.8)
## [1] 2.802

We can envision the MDE as the threshold where, for power = .8, 80% of the sampling distribution of the observed effect would be larger than observed effect:

curve(dnorm(x, 0, 1), col = "gray", xlim = c(-3, 8))  # null hypothesis
segments(0, 0, 0, dnorm(0, 0, 1), col = "gray")  # mean
curve(dnorm(x, 4, 1), col = "blue", add = TRUE)  # alternative hypothesis
segments(4, 0, 4, dnorm(4, 4, 1), col = "blue")  # mean

plot of chunk unnamed-chunk-4

calculate power for a one-tailed test and plot:

p <- qnorm((1 - 0.05), 0, 1) + qnorm(0.8, 0, 1)
segments(p, 0, p, dnorm(p, 4, 1), lwd = 2)
## Error: plot.new has not been called yet

note how the MDE is larger than the smallest effect that would be considered “significant”:

e <- qnorm((1 - 0.05), 0, 1)
segments(e, 0, e, dnorm(e), lwd = 2)
## Error: plot.new has not been called yet

As in standard power calculations, we still need to calculate the standard deviation of the outcome.

Power in cluster randomized experiments

FORTHCOMING

Probability distributions

A critical aspect of (parametric) statistical analysis is the use of probability distributions, like the normal (Gaussian) distribution. These distributions underly all of our common (parametric) statistical tests, like t-tests, chi-squared tests, ANOVA, regression, and so forth. R has functions to draw values from all of the common distributions (normal, t, F, chi-squared, binomial, poisson, etc.), as well as many others.

There are four families of functions that R implements uniformly across each of these distributions that enable users to extract the probability density, the cumulative density, and the quantiles of a distribution. For example, the dnorm function provides the density of the normal distribution at a specific quantile. The pnorm function provides the cumulative density of the normal distribution at a specific quantile. The qnorm function provides the quantile of the normal distribution at a specified cumulative density. An additional function, rnorm, draws random values from the normal distribution (but this is discussed in detail in the random sampling tutorial).

The same functions are also implemented for the other common distributions. For example, the functions for Student's t distribution are dt, pt, and qt. For the chi-squared distribution, they are dchisq, pchisq, and qchisq. Hopefully you see the pattern. The rest of this tutorial walks through how to use these functions.

Density functions

The density functions provide the density of a specified distribution at a given quantile. This means that the d* family of functions can extract the density not just from a given distribution but from any version of the distribution. For example, calling:

dnorm(0)
## [1] 0.3989

provides the density of the standard normal distribution (i.e., a normal distribution with mean 0 and standard deviation 1) at the point 0 (i.e., at the distribution's mean). We can retrieve the density at a different value (or vector of values) easily:

dnorm(1)
## [1] 0.242
dnorm(1:3)
## [1] 0.241971 0.053991 0.004432

We can also retrieve densities from a different normal distribution (e.g., one with a higher mean or larger SD):

dnorm(1, mean = -2)
## [1] 0.004432
dnorm(1, mean = 5)
## [1] 0.0001338
dnorm(1, mean = 0, sd = 3)
## [1] 0.1258

Cumulative distribution functions

We are often much more interested in the cumulative distribution (i.e., how much of the distribution is to the left of the indicated value). For this, we can use the p* family of functions. As an example, let's obtain the cumulative distribution function's value from a standard normal distribution at point 0 (i.e., the distribution's means):

pnorm(0)
## [1] 0.5

Unsurprisingly, the value is .5 because half of the distribution is to the left of 0.

When we conduct statistical significance testing, we compare a value we observe to the cumulative distribution function. As one might recall, the value of 1.65 is the (approximate) critical value for a 90% normal confidence interval. We can see that by requesting:

pnorm(1.65)
## [1] 0.9505

The comparable value for a 95% CI is 1.96:

pnorm(1.96)
## [1] 0.975

Note how the values are ~.95 and ~.975, respectively, because those are critical values for two-tailed tests. If we plug a negative value into the pnorm function, we'll receive the cumulative probability for the left side of the distribution:

pnorm(-1.96)
## [1] 0.025

Thus subtracting the output of pnorm for the negative input from the output for the positive input, we'll see that 95% of the density is between -1.96 and 1.96 (in the standard normal distribution):

pnorm(1.96) - pnorm(-1.96)
## [1] 0.95

Quantile function

The examples just described relied on the heuristic values of 1.65 and 1.96 as the thresholds for 90% and 95% two-tailed tests. But to find the exact points at which the normal distribution has accumulated a particular cumulative density, we can use the qnorm function. Essentially, qnorm is the reverse of pnorm. To obtain the critical values for a two-tailed 95% confidence interval, we would plug .025 and .975 into qnorm:

qnorm(c(0.025, 0.975))
## [1] -1.96  1.96

And we could actually nest that call inside a pnorm function to see that pnorm and qnorm are opposites:

pnorm(qnorm(c(0.025, 0.975)))
## [1] 0.025 0.975

For one-tailed tests, we simply specify the cumulative density. So, for a one-tailed 95% critical value, we would specify:

qnorm(0.95)
## [1] 1.645

We could obtain the other tail by specifying:

qnorm(0.05)
## [1] -1.645

Or, we could request the upper-tail of the distribution rather than the lower (left) tail (which is the default):

qnorm(0.95, lower.tail = FALSE)
## [1] -1.645

As with dnorm, pnorm, and qnorm work on arbitrary normal distributions, but its results will be unfamiliar to us:

pnorm(1.96, mean = 3)
## [1] 0.1492
qnorm(0.95, mean = 3)
## [1] 4.645

Other distributions

As stated above, R supplies functions analogous to those just described for numerous distributions. Details about all of the distributions can be found in the help files: ? Distributions.

Here are a few examples:

t distribution Note: The t distribution functions require a df argument, specifying the degrees of freedom.

qt(0.95, df = 1000)
## [1] 1.646
qt(c(0.025, 0.975), df = 1000)
## [1] -1.962  1.962

Binomial distribution The binomial distribution functions work as above, but require size and prob arguments, specifying the number of draw and the probability of success. So, if we are modelling fair coin flips:

dbinom(0, 1, 0.5)
## [1] 0.5
pbinom(0, 1, 0.5)
## [1] 0.5
qbinom(0.95, 1, 0.5)
## [1] 1
qbinom(c(0.025, 0.975), 1, 0.5)
## [1] 0 1

Note: Because the binomial is a discrete distribution, the values here might seem strange compared to the above.

R object classes

R objects can be of several different “classes” A class essentially describes what kind of information is contained in the object

Numeric

Often an object contains “numeric” class data, like a number or vector of numbers We can test the class of an object using class:

class(12)
## [1] "numeric"
class(c(1, 1.5, 2))
## [1] "numeric"

While most numbers are of class “numeric”, a subset are “integer”:

class(1:5)
## [1] "integer"

We can coerce numeric class objects to an integer class:

as.integer(c(1, 1.5, 2))
## [1] 1 1 2

But note that this modifies the second item in the vector (1.5 becomes 1)

Character

Other common classes include “character” data We see character class data in country names or certain survey responses

class("United States")
## [1] "character"

If we try to coerce character to numeric, we get a warning and the result is a missing value:

as.numeric("United States")
## Warning: NAs introduced by coercion
## [1] NA

If we combine a numeric (or integer) and a character together in a vector, the result is character:

class(c(1, "test"))
## [1] "character"

You can see that the 1 is coerced to character:

c(1, "test")
## [1] "1"    "test"

We can also coerce a numeric vector to character simply by changing its class:

a <- 1:4
class(a)
## [1] "integer"
class(a) <- "character"
class(a)
## [1] "character"
a
## [1] "1" "2" "3" "4"

Factor

Another class is “factor” Factors are very important to R, especially in regression modelling Factors combine characteristics of numeric and character classes We can create a factor from numeric data using factor:

factor(1:3)
## [1] 1 2 3
## Levels: 1 2 3

We see that factor displays a special levels attribute Levels describe the unique values in the vector e.g., with the following factor, there are six values but only two levels:

factor(c(1, 2, 1, 2, 1, 2))
## [1] 1 2 1 2 1 2
## Levels: 1 2

To see just the levels, we can use the levels function:

levels(factor(1:3))
## [1] "1" "2" "3"
levels(factor(c(1, 2, 1, 2, 1, 2)))
## [1] "1" "2"

We can also build factors from character data:

factor(c("a", "b", "b", "c"))
## [1] a b b c
## Levels: a b c

We can look at factors in more detail in the factors.R script

Logical

Another common class is “logical” data This class involves TRUE | FALSE objects We can look at that class in detail in the logicals.R script

R Objects and Environment

One of the most confusing aspects of R for users of other statistical software is the idea that one can have any number of objects available in the R environment. One need not be constrained to a single rectangular dataset. This also means that it can be confusing to see what data is actually loaded into memory at any point in time. Here we discuss some tools for understanding the R working environment.

Listing objects

Let's start by clearing our workspace:

# rm(list=ls())

This option should be available in RGui under menu Miscellaneous > Remove all objects. Then create some R objects:

set.seed(1)
x <- rbinom(50, 1, 0.5)
y <- ifelse(x == 1, rnorm(sum(x == 1), 1, 1), rnorm(sum(!x == 1), 2, 1))
mydf <- data.frame(x = x, y = y)

Once we have a number of objects stored in memory, we can look at all of them using ls:

ls()
##   [1] "a"             "allout"        "amat"          "b"            
##   [5] "between"       "bmat"          "c"             "change"       
##   [9] "cmat"          "coef.mi"       "coefs.amelia"  "d"            
##  [13] "d2"            "df1"           "df2"           "e"            
##  [17] "e1"            "e2"            "e3"            "e4"           
##  [21] "englebert"     "f"             "FUN"           "g1"           
##  [25] "g2"            "grandm"        "grandse"       "grandvar"     
##  [29] "height"        "imp"           "imp.amelia"    "imp.mi"       
##  [33] "imp.mice"      "lm"            "lm.amelia.out" "lm.mi.out"    
##  [37] "lm.mice.out"   "lmfit"         "lmp"           "localfit"     
##  [41] "localp"        "logodds"       "logodds_lower" "logodds_se"   
##  [45] "logodds_upper" "m"             "m1"            "m2"           
##  [49] "m2a"           "m2b"           "m3a"           "m3b"          
##  [53] "me"            "me_se"         "means"         "mmdemo"       
##  [57] "mydf"          "myformula"     "n"             "newdata"      
##  [61] "newdata1"      "newdata2"      "newdf"         "newvar"       
##  [65] "out"           "p1"            "p2"            "p2a"          
##  [69] "p2b"           "p3a"           "p3b"           "p3b.fitted"   
##  [73] "part1"         "part2"         "pool.mice"     "ppcurve"      
##  [77] "s"             "s.amelia"      "s.mi"          "s.mice"       
##  [81] "s.orig"        "s.real"        "s2"            "search"       
##  [85] "ses"           "ses.amelia"    "tmpdf"         "tmpsplit"     
##  [89] "tr"            "w"             "weight"        "within"       
##  [93] "x"             "X"             "x1"            "x2"           
##  [97] "X2"            "x3"            "y"             "y1"           
## [101] "y1s"           "y2"            "y2s"           "y3"           
## [105] "y3s"           "z"             "z1"            "z2"

This shows us all of the objects that are currently saved. If we do another operation but do not save the result:

2 + 2
## [1] 4
ls()
##   [1] "a"             "allout"        "amat"          "b"            
##   [5] "between"       "bmat"          "c"             "change"       
##   [9] "cmat"          "coef.mi"       "coefs.amelia"  "d"            
##  [13] "d2"            "df1"           "df2"           "e"            
##  [17] "e1"            "e2"            "e3"            "e4"           
##  [21] "englebert"     "f"             "FUN"           "g1"           
##  [25] "g2"            "grandm"        "grandse"       "grandvar"     
##  [29] "height"        "imp"           "imp.amelia"    "imp.mi"       
##  [33] "imp.mice"      "lm"            "lm.amelia.out" "lm.mi.out"    
##  [37] "lm.mice.out"   "lmfit"         "lmp"           "localfit"     
##  [41] "localp"        "logodds"       "logodds_lower" "logodds_se"   
##  [45] "logodds_upper" "m"             "m1"            "m2"           
##  [49] "m2a"           "m2b"           "m3a"           "m3b"          
##  [53] "me"            "me_se"         "means"         "mmdemo"       
##  [57] "mydf"          "myformula"     "n"             "newdata"      
##  [61] "newdata1"      "newdata2"      "newdf"         "newvar"       
##  [65] "out"           "p1"            "p2"            "p2a"          
##  [69] "p2b"           "p3a"           "p3b"           "p3b.fitted"   
##  [73] "part1"         "part2"         "pool.mice"     "ppcurve"      
##  [77] "s"             "s.amelia"      "s.mi"          "s.mice"       
##  [81] "s.orig"        "s.real"        "s2"            "search"       
##  [85] "ses"           "ses.amelia"    "tmpdf"         "tmpsplit"     
##  [89] "tr"            "w"             "weight"        "within"       
##  [93] "x"             "X"             "x1"            "x2"           
##  [97] "X2"            "x3"            "y"             "y1"           
## [101] "y1s"           "y2"            "y2s"           "y3"           
## [105] "y3s"           "z"             "z1"            "z2"

This result is not visible with ls. Esssentially it disappears into the ether.

Viewing individual objects

Now we can look at any of these objects just by calling their name:

x
##  [1] 0 0 1 1 0 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1 1 0 1 0 0 0 0 0 1 0 0 1 0 0 1
## [36] 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1
y
##  [1]  2.3411  0.8706 -0.4708  0.5218  1.6328  2.3587  0.8972  1.3877
##  [9]  0.9462  1.9608  2.6897  2.0280  0.9407  2.1888  1.7632  3.4656
## [17]  0.7466  1.6970  2.4755  0.3112  0.2925  1.0659  1.7685  2.3411
## [25]  0.8706  3.4330  3.9804  1.6328  0.8442  2.5697  1.8649  1.4179
## [33]  1.9608  2.6897  1.3877  0.9462 -0.3771  0.1950  0.6057  2.1533
## [41]  2.1000  1.7632  0.8355  0.7466  1.6970  1.5567  2.3411  0.8706
## [49]  1.3646  1.7685
mydf
##    x       y
## 1  0  2.3411
## 2  0  0.8706
## 3  1 -0.4708
## 4  1  0.5218
## 5  0  1.6328
## 6  1  2.3587
## 7  1  0.8972
## 8  1  1.3877
## 9  1  0.9462
## 10 0  1.9608
## 11 0  2.6897
## 12 0  2.0280
## 13 1  0.9407
## 14 0  2.1888
## 15 1  1.7632
## 16 0  3.4656
## 17 1  0.7466
## 18 1  1.6970
## 19 0  2.4755
## 20 1  0.3112
## 21 1  0.2925
## 22 0  1.0659
## 23 1  1.7685
## 24 0  2.3411
## 25 0  0.8706
## 26 0  3.4330
## 27 0  3.9804
## 28 0  1.6328
## 29 1  0.8442
## 30 0  2.5697
## 31 0  1.8649
## 32 1  1.4179
## 33 0  1.9608
## 34 0  2.6897
## 35 1  1.3877
## 36 1  0.9462
## 37 1 -0.3771
## 38 0  0.1950
## 39 1  0.6057
## 40 0  2.1533
## 41 1  2.1000
## 42 1  1.7632
## 43 1  0.8355
## 44 1  0.7466
## 45 1  1.6970
## 46 1  1.5567
## 47 0  2.3411
## 48 0  0.8706
## 49 1  1.3646
## 50 1  1.7685

The first two objects (x and y) are vectors, so they simply print to the console. The second object (mydf) is a dataframe, so its contents are printed as columns with row numbers. If we call one of the columns from the dataframe, it will look just like a vector:

mydf$x
##  [1] 0 0 1 1 0 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1 1 0 1 0 0 0 0 0 1 0 0 1 0 0 1
## [36] 1 1 0 1 0 1 1 1 1 1 1 0 0 1 1

This looks the same as just calling the x object and indeed they are the same:

mydf$x == x
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [29] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [43] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

But if we change one of the objects, it only affects the object we changed:

x <- rbinom(50, 1, 0.5)
mydf$x == x
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE
## [12] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
## [23]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE
## [34] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE
## [45] FALSE  TRUE  TRUE FALSE FALSE  TRUE

So by storing something new into x we change it, but not mydf$x because that's a different object.

Object class

We sometimes what to know what kind of object something is. We can see this with class:

class(x)
## [1] "integer"
class(y)
## [1] "numeric"
class(mydf)
## [1] "data.frame"

We can also use class on the columns of a dataframe:

class(mydf$x)
## [1] "integer"
class(mydf$y)
## [1] "numeric"

This is helpful, but it doesn't tell us a lot about the objects (i.e., it's not a very good summary). We can, however, see more detail using some other functions.

str

One way to get very detailed information about an object is with str (i.e, structure):

str(x)
##  int [1:50] 1 1 0 0 1 0 1 0 0 0 ...

This output tells us that this is an object of class “integer”, with length 50, and it shows the first few values.

str(y)
##  num [1:50] 2.341 0.871 -0.471 0.522 1.633 ...

This output tells us that this is an object of class “numeric”, with length 50, and it shows the first few values.

str(mydf)
## 'data.frame':    50 obs. of  2 variables:
##  $ x: int  0 0 1 1 0 1 1 1 1 0 ...
##  $ y: num  2.341 0.871 -0.471 0.522 1.633 ...

This output tells us that this is an object of class “data.frame”, with 50 observations on two variables. It then provides the same type of details for each variable that we would see by calling str(mydf$x), etc. directly. Using str on dataframes is therefore a very helpful and compact way to look at your data. More about this later.

summary

To see more details we may want to use some other functions. One particularly helpful function is summary, which provides some basic details about an object. For the two vectors, this will give us summary statistics.

summary(x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    1.00    0.56    1.00    1.00
summary(y)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -0.471   0.871   1.630   1.550   2.140   3.980

For the dataframe, it will give us summary statistics for everything in the dataframe:

summary(mydf)
##        x              y         
##  Min.   :0.00   Min.   :-0.471  
##  1st Qu.:0.00   1st Qu.: 0.871  
##  Median :1.00   Median : 1.633  
##  Mean   :0.54   Mean   : 1.549  
##  3rd Qu.:1.00   3rd Qu.: 2.140  
##  Max.   :1.00   Max.   : 3.980

Note how the printed information is the same but looks different. This is because R prints slightly different things depending on the class of the input object. If you want to look “under the hood”, you will see that summary is actually a set of multiple functions. When you type summary you see that R is calling a “method” depending on the class of the object. For our examples, the methods called are summary.default and summary.data.frame, which differ in what they print to the console for vectors and dataframes, respectively.

Conveniently, we can also save any output of a function as a new object. So here we can save the summary of x as a new object:

sx <- summary(x)

And do the same for mydf:

smydf <- summary(mydf)

We can then see that these new objects also have classes:

class(sx)
## [1] "summaryDefault" "table"
class(mydf)
## [1] "data.frame"

And, as you might be figuring out, an object's class determines how it is printed to the console. Again, looking “under the hood”, this is because there are separate print methods for each object class (see print.data.frame for how a dataframe is printed and print.table for how the summary of a dataframe is printed). This can create some confusion, though, because it means that what is printed is a reflection of the underlying object but is not actually the object. A bit existential, right? Because calling objects shows a printed rendition of an object, we can sometimes get confused about what that object actually is. This is where str can again be helpful:

str(sx)
## Classes 'summaryDefault', 'table'  Named num [1:6] 0 0 1 0.56 1 1
##   ..- attr(*, "names")= chr [1:6] "Min." "1st Qu." "Median" "Mean" ...
str(smydf)
##  'table' chr [1:6, 1:2] "Min.   :0.00  " "1st Qu.:0.00  " ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:6] "" "" "" "" ...
##   ..$ : chr [1:2] "      x" "      y"

Here we see that the summary of x and summary of mydf are both tables. summary(x) is a one-dimensional table, whereas summary(mydf) is a two-dimensional table (because it shows multiple variables). Because these objects are tables, it actually means we can index them like any other table:

sx[1]
## Min. 
##    0
sx[2:3]
## 1st Qu.  Median 
##       0       1
smydf[, 1]
##                                                                     
## "Min.   :0.00  " "1st Qu.:0.00  " "Median :1.00  " "Mean   :0.54  " 
##                                   
## "3rd Qu.:1.00  " "Max.   :1.00  "
smydf[1:3, ]
##        x                y           
##  "Min.   :0.00  " "Min.   :-0.471  "
##  "1st Qu.:0.00  " "1st Qu.: 0.871  "
##  "Median :1.00  " "Median : 1.633  "

This can be confusing because sx and smydf do not look like objects we can index, but that is because the way they are printed doesn't reflect the underlying structure of the objects.

Structure of other objects

It can be helpful to look at another example to see how what is printed can be confusing. Let's conduct a t-test on our data and see the result:

t.test(mydf$x, mydf$y)
## 
##  Welch Two Sample t-test
## 
## data:  mydf$x and mydf$y
## t = -6.745, df = 75.45, p-value = 2.714e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.3067 -0.7109
## sample estimates:
## mean of x mean of y 
##     0.540     1.549

The result is a bunch of details about the t.test. Like above, we can save this object:

myttest <- t.test(mydf$x, mydf$y)

Then we can call the object again whenever we want without repeating the calculation:

myttest
## 
##  Welch Two Sample t-test
## 
## data:  mydf$x and mydf$y
## t = -6.745, df = 75.45, p-value = 2.714e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.3067 -0.7109
## sample estimates:
## mean of x mean of y 
##     0.540     1.549

If we try to run summary on this, we get some weirdness:

summary(myttest)
##             Length Class  Mode     
## statistic   1      -none- numeric  
## parameter   1      -none- numeric  
## p.value     1      -none- numeric  
## conf.int    2      -none- numeric  
## estimate    2      -none- numeric  
## null.value  1      -none- numeric  
## alternative 1      -none- character
## method      1      -none- character
## data.name   1      -none- character

Because there is no method for summarizing a t.test. Why is this? It is because of the class and structure of our myttest object. Let's look:

class(myttest)
## [1] "htest"

This says it is of class “htest”. Not intuitive, but that's what it is.

str(myttest)
## List of 9
##  $ statistic  : Named num -6.75
##   ..- attr(*, "names")= chr "t"
##  $ parameter  : Named num 75.4
##   ..- attr(*, "names")= chr "df"
##  $ p.value    : num 2.71e-09
##  $ conf.int   : atomic [1:2] -1.307 -0.711
##   ..- attr(*, "conf.level")= num 0.95
##  $ estimate   : Named num [1:2] 0.54 1.55
##   ..- attr(*, "names")= chr [1:2] "mean of x" "mean of y"
##  $ null.value : Named num 0
##   ..- attr(*, "names")= chr "difference in means"
##  $ alternative: chr "two.sided"
##  $ method     : chr "Welch Two Sample t-test"
##  $ data.name  : chr "mydf$x and mydf$y"
##  - attr(*, "class")= chr "htest"

This is more interesting. The output tells us that myttest is a list of 9 objects. If we compare this to the output of myttest, we will see that when we call myttest, R is printing the underlying list in a pretty fashion for us. But because myttest is a list, it means that we can access any of the values in the list simply by calling them. So the list consists of statistic, parameter, p.value, etc. Let's look at some of them:

myttest$statistic
##      t 
## -6.745
myttest$p.value
## [1] 2.714e-09

The ability to extract these values from the underlying object (in addition to see them printed to the console in pretty form), means that we can easily use objects again and again to, e.g., combine results of multiple tests into a simplified table or use values from one test elsewhere in our analysis. As a simple example, let's compare the p-values of the same t.test under different hypotheses (two-sided, which is the default, and each of the one-sided alternatives):

myttest2 <- t.test(mydf$x, mydf$y, "greater")
myttest3 <- t.test(mydf$x, mydf$y, "less")
myttest$p.value
## [1] 2.714e-09
myttest2$p.value
## [1] 1
myttest3$p.value
## [1] 1.357e-09

This is much easier than having to copy and paste the p-value from each of the outputs and because these objects are stored in memory, we can access them at any point later in this session.

Recoding

Recoding is one of the most important tasks in preparing for an analysis. Often the data we have is not in the format we need to perform an analysis. Changing in data in R is easy, as long as we understand indexing and assignment.

To recode values, we can either rely on positional or logical indexing. To change a particular value, we can rely on positions:

a <- 1:10
a[1] <- 99
a
##  [1] 99  2  3  4  5  6  7  8  9 10

But this does not scale well. This is no better than recoding by hand. Logical indexing is then much easier to change multiple values at once:

a[a < 5] <- 99
a
##  [1] 99 99 99 99  5  6  7  8  9 10

We can use multiple logical indices to change all of our values. For example, we could turn a vector into groups base on their values:

b <- 1:20
c <- b
c[b < 6] <- 1
c[b >= 6 & b <= 10] <- 2
c[b >= 11 & b <= 15] <- 3
c[b > 15] <- 4

Looking at the two vectors as a matrix, we can see how our input values translated to outputs:

cbind(b, c)
##        b c
##  [1,]  1 1
##  [2,]  2 1
##  [3,]  3 1
##  [4,]  4 1
##  [5,]  5 1
##  [6,]  6 2
##  [7,]  7 2
##  [8,]  8 2
##  [9,]  9 2
## [10,] 10 2
## [11,] 11 3
## [12,] 12 3
## [13,] 13 3
## [14,] 14 3
## [15,] 15 3
## [16,] 16 4
## [17,] 17 4
## [18,] 18 4
## [19,] 19 4
## [20,] 20 4

We can obtain the same result with nested ifelse functions:

d <- ifelse(b < 6, 1, ifelse(b >= 6 & b <= 10, 2, ifelse(b >= 11 & b <= 15, 
    3, ifelse(b > 15, 4, NA))))
cbind(b, c, d)
##        b c d
##  [1,]  1 1 1
##  [2,]  2 1 1
##  [3,]  3 1 1
##  [4,]  4 1 1
##  [5,]  5 1 1
##  [6,]  6 2 2
##  [7,]  7 2 2
##  [8,]  8 2 2
##  [9,]  9 2 2
## [10,] 10 2 2
## [11,] 11 3 3
## [12,] 12 3 3
## [13,] 13 3 3
## [14,] 14 3 3
## [15,] 15 3 3
## [16,] 16 4 4
## [17,] 17 4 4
## [18,] 18 4 4
## [19,] 19 4 4
## [20,] 20 4 4

Another way that is sometimes more convenient for writing this involves the package car, which we would need to load:

library(car)

In this library we can use function recode to recode a vector:

e <- recode(b, "1:5=1; 6:10=2; 11:15=3; 16:20=4; else=NA")

The recode function can also infer the minimum ('lo') and maximum ('hi') values in a vector:

f <- recode(b, "lo:5=1; 6:10=2; 11:15=3; 16:hi=4; else=NA")

All of these techniques produce the same result:

cbind(b, c, d, e, f)
##        b c d e f
##  [1,]  1 1 1 1 1
##  [2,]  2 1 1 1 1
##  [3,]  3 1 1 1 1
##  [4,]  4 1 1 1 1
##  [5,]  5 1 1 1 1
##  [6,]  6 2 2 2 2
##  [7,]  7 2 2 2 2
##  [8,]  8 2 2 2 2
##  [9,]  9 2 2 2 2
## [10,] 10 2 2 2 2
## [11,] 11 3 3 3 3
## [12,] 12 3 3 3 3
## [13,] 13 3 3 3 3
## [14,] 14 3 3 3 3
## [15,] 15 3 3 3 3
## [16,] 16 4 4 4 4
## [17,] 17 4 4 4 4
## [18,] 18 4 4 4 4
## [19,] 19 4 4 4 4
## [20,] 20 4 4 4 4

Instead of checking this visually, we can use an all.equal function to compare two vectors:

all.equal(c, d)
## [1] TRUE
all.equal(c, e)
## [1] TRUE
all.equal(c, f)
## [1] TRUE

Note: if we instead used the == double equals comparator, the result would be a logical vector that compares corresponding values in each vector:

c == d
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [15] TRUE TRUE TRUE TRUE TRUE TRUE

If all.equal turns out false, the == double equals comparator shows where the vectors differ.

Recoding missing values

Missing values are handled somewhat differently from other values. If our vector has missing values, we need to use the is.na logical function to identify them:

g <- c(1:5, NA, 7:13, NA, 15)
h <- g
g
##  [1]  1  2  3  4  5 NA  7  8  9 10 11 12 13 NA 15
g[is.na(g)] <- 99

The recode function can also handle missing values to produce the same result:

h <- recode(h, "NA=99")
all.equal(g, h)
## [1] TRUE

Recoding based on multiple input variables

Often we want to recode based on two variables (e.g., age and sex) to produce categories. This is easy using the right logical statements. Let's create some fake data (in the form of a dataframe) using a function called expand.grid:

i <- expand.grid(1:4, 1:2)
i
##   Var1 Var2
## 1    1    1
## 2    2    1
## 3    3    1
## 4    4    1
## 5    1    2
## 6    2    2
## 7    3    2
## 8    4    2

This dataframe has two variables (columns), one with four categories ('Var1') and one with two ('Var2'). Perhaps we want to create a variable that reflects each unique combination of the two variables. We can do this with ifelse:

ifelse(i$Var2 == 1, i$Var1, i$Var1 + 4)
## [1] 1 2 3 4 5 6 7 8

This statement says that if an element from i$Var2 is equal to 1, then the value of the corresponding element in our new variable is equal to the value in i$Var1. otherwise the value of the corresponding element in the new vecotr is set to i$Var1+4.

That solution requires us to know something about the data (that it's okay to simply add 4 to get unique values). A more general solution is to use the interaction function:

interaction(i$Var1, i$Var2)
## [1] 1.1 2.1 3.1 4.1 1.2 2.2 3.2 4.2
## Levels: 1.1 2.1 3.1 4.1 1.2 2.2 3.2 4.2

This produces a factor vector with eight unique values. The names are a little strange, but it will always give us every unique combination of the two (or more) vectors.

There are lots of different ways to recode vectors, but these are the basic tools that can be combined to do almost anything.

Regression-related plotting

R's graphical capabilities are very strong. This is particularly helpful when deal with regression.

If we have a regression model, there are a number of ways we can plot the relationships between variables. We can also use plots for checking model specification and assumptions.

Let's start with a basic bivariate regression:

set.seed(1)
x <- rnorm(1000)
y <- 1 + x + rnorm(1000)
ols1 <- lm(y ~ x)

The easiest plot we can draw is the relationship between y and x.

plot(y ~ x)

plot of chunk unnamed-chunk-2

We can also add a line representing this relationship to the plot. To get the coefficients from the model, we can use coef:

coef(ols1)
## (Intercept)           x 
##      0.9838      1.0064

We can then use those coefficients in the line-plotting function abline:

plot(y ~ x)
abline(a = coef(ols1)[1], b = coef(ols1)[2])

plot of chunk unnamed-chunk-4

We can specify “graphical parameters” in both plot and abline to change the look. For example, we could change the color:

plot(y ~ x, col = "gray")
abline(a = coef(ols1)[1], b = coef(ols1)[2], col = "red")

plot of chunk unnamed-chunk-5

We can also use plot to extract several diagnostics for our model. Almost all of these help us to identify outliers or other irregularities. If we type:

plot(ols1)

plot of chunk unnamed-chunk-6 plot of chunk unnamed-chunk-6 plot of chunk unnamed-chunk-6 plot of chunk unnamed-chunk-6

We are given a series of plots describing the model. We can also see two other plots that are not displayed by default. To obtain a given plot, we use the which parameter inside plot:

plot(ols1, which = 4)

plot of chunk unnamed-chunk-7

(1) A residual plot (2) A Quantile-Quantile plot to check the distribution of our residuals (3) A scale-location plot (4) Cook's distance, to identify potential outliers (5) A residual versus leverage plot, to identify potential outliers (6) Cook's distance versus leverage plot

Besides the default plot(ols1, which=1) to get residuals, we can also plot residuals manually:

plot(ols1$residuals ~ x)

plot of chunk unnamed-chunk-8

We might want to do this to check whether another variable should be in our model:

x2 <- rnorm(1000)
plot(ols1$residuals ~ x2)

plot of chunk unnamed-chunk-9

Obviously, in this case x2 doesn't belong in the model. Let's see a case where the plot would help us:

y2 <- x + x2 + rnorm(1000)
ols2 <- lm(y2 ~ x)
plot(ols2$residuals ~ x2)

plot of chunk unnamed-chunk-10

Clearly, x2 is strongly related to our residuals, so it belongs in the model.

We can also use residuals plots to check for nonlinear relationships (i.e., functional form):

y3 <- x + (x^2) + rnorm(1000)
ols3 <- lm(y3 ~ x)
plot(ols3$residuals ~ x)

plot of chunk unnamed-chunk-11

Even though x is in our model, it is not in the correct form. Let's try fixing that and see what happens to our plot:

ols3b <- lm(y3 ~ x + I(x^2))

Note: We need to use the I() operator inside formulae in order to have R generate the x^2 variable! Note (continued): This saves us from having to defined a new variable: xsq <- x^2 and then running the model.

plot(ols3b$residuals ~ x)

plot of chunk unnamed-chunk-13

Clearly, the model now incorporates x in the correct functional form.

Of course, if we had plotted our data originally:

plot(y3 ~ x)

plot of chunk unnamed-chunk-14

We would have seen the non-linear relationship and could have skipped the incorrect model entirely.

Residual plots can also show heteroskedasticity

x3 <- runif(1000, 1, 10)
y4 <- (3 * x3) + rnorm(1000, 0, x3)
ols4 <- lm(y4 ~ x3)
plot(ols4$residuals ~ x3)

plot of chunk unnamed-chunk-15

Here we see that x3 is correctly specified in the model. There is no relationship between x3 and y4. But, the variance of the residuals is much higher at higher levels of x3. We might need to rely on a different estimate of our regression SEs than the default provided by R. And, again, this is a problem we could have identified by plotting our original data:

plot(y4 ~ x3)

plot of chunk unnamed-chunk-16

Multivariate OLS plotting

If our model has more than one independent variable, these plotting tools all still work.

set.seed(1)
x5 <- rnorm(1000)
z5 <- runif(1000, 1, 5)
y5 <- x5 + z5 + rnorm(1000)
ols5 <- lm(y5 ~ x5 + z5)

We can see all six of our diagnostic plots:

plot(ols5, 1:6)

plot of chunk unnamed-chunk-18 plot of chunk unnamed-chunk-18 plot of chunk unnamed-chunk-18 plot of chunk unnamed-chunk-18 plot of chunk unnamed-chunk-18 plot of chunk unnamed-chunk-18

We can plot our outcome against the input variables:

plot(y5 ~ x5)

plot of chunk unnamed-chunk-19

plot(y5 ~ z5)

plot of chunk unnamed-chunk-19

We can see residual plots:

plot(ols5$residuals ~ x5)

plot of chunk unnamed-chunk-20

plot(ols5$residuals ~ z5)

plot of chunk unnamed-chunk-20

We might also want to check for colinearity between our input variables. We could do this with cor:

cor(x5, z5)
## [1] 0.03504

Or we could see it visually with a scatterplot:

plot(x5, z5)

plot of chunk unnamed-chunk-22

In either case, there's no relationship.

We can also plot our effects from our model against our input data:

coef(ols5)
## (Intercept)          x5          z5 
##     -0.0639      1.0183      1.0182

Lets plot the two input variables together using layout:

layout(matrix(1:2, ncol = 2))
plot(y5 ~ x5, col = "gray")
abline(a = coef(ols5)[1] + mean(z5), b = coef(ols5)["x5"], col = "red")
plot(y5 ~ z5, col = "gray")
abline(a = coef(ols5)[1] + mean(x5), b = coef(ols5)["z5"], col = "red")

plot of chunk unnamed-chunk-24

Note: We add the expected value of the other input variable so that lines are drawn correctly. If we plot each bivariate relationship separately, we'll see how we get the lines of best fit:

ols5a <- lm(y5 ~ x5)
ols5b <- lm(y5 ~ z5)
layout(matrix(1:2, ncol = 2))
plot(y5 ~ x5, col = "gray")
abline(a = coef(ols5a)[1], b = coef(ols5a)["x5"], col = "red")
plot(y5 ~ z5, col = "gray")
abline(a = coef(ols5b)[1], b = coef(ols5b)["z5"], col = "red")

plot of chunk unnamed-chunk-25

If we regress the residuals from ols5a on z5 we'll see some magic happen. The estimated coefficient for z5 is almost identical to that from our full y5 ~ x5 + z5 model:

tmpz <- lm(ols5a$residuals ~ z5)
coef(tmpz)["z5"]
##    z5 
## 1.017
coef(ols5)["z5"]
##    z5 
## 1.018

The same pattern works if we repeat this process for our x5 input variable:

tmpx <- lm(ols5b$residuals ~ x5)
coef(tmpx)["x5"]
##    x5 
## 1.017
coef(ols5)["x5"]
##    x5 
## 1.018

In other words, the coefficients from our full model ols5 reflect the regression of each input variable… on the residuals of y unexplained by the contribution of the other input variable(s). Let's see this visually by drawing the bivariate regression lines in blue. And then overlapping these with the full model estimates in red:

layout(matrix(1:2, ncol = 2))
plot(y5 ~ x5, col = "gray")
abline(a = coef(ols5a)[1], b = coef(ols5a)["x5"], col = "blue")
abline(a = coef(ols5)[1] + mean(z5), b = coef(ols5)["x5"], col = "red")
plot(y5 ~ z5, col = "gray")
coef(lm(ols5a$residuals ~ z5))["z5"]
##    z5 
## 1.017
abline(a = coef(ols5b)[1], b = coef(ols5b)["z5"], col = "blue")
abline(a = coef(ols5)[1] + mean(x5), b = coef(ols5)["z5"], col = "red")

plot of chunk unnamed-chunk-28

In that example, x5 and z5 were uncorrelated, so there was no bias from excluding one variable. Let's look at a situation where we find omitted variable bias due to correlation between input variables.

set.seed(1)
x6 <- rnorm(1000)
z6 <- x6 + rnorm(1000, 0, 1.5)
y6 <- x6 + z6 + rnorm(1000)

We can see from a plot and correlation and that our two input variables are correlated:

cor(x6, z6)
## [1] 0.5565
plot(x6, z6)

plot of chunk unnamed-chunk-30

Let's estimate some models:

ols6 <- lm(y6 ~ x6 + z6)
ols6a <- lm(y6 ~ x6)
ols6b <- lm(y6 ~ z6)

And then let's compare the bivariate estimates (blue) to the multivariate estimates (red):

layout(matrix(1:2, ncol = 2))
plot(y6 ~ x6, col = "gray")
abline(a = coef(ols6a)[1], b = coef(ols6a)["x6"], col = "blue")
abline(a = coef(ols6)[1] + mean(z6), b = coef(ols6)["x6"], col = "red")
plot(y6 ~ z6, col = "gray")
coef(lm(ols6a$residuals ~ z6))["z6"]
##     z6 
## 0.7004
abline(a = coef(ols6b)[1], b = coef(ols6b)["z6"], col = "blue")
abline(a = coef(ols6)[1] + mean(x6), b = coef(ols6)["z6"], col = "red")

plot of chunk unnamed-chunk-32

As we can see, the estimates from our bivariate models overestimate the impact of each input. We could of course see this in the raw coefficients, as well:

coef(ols6)
## (Intercept)          x6          z6 
##     0.01624     1.03437     1.01468
coef(ols6a)
## (Intercept)          x6 
##   -0.008398    2.058839
coef(ols6b)
## (Intercept)          z6 
##     0.01563     1.33198

These plots show, however, that omitted variable bias can be dangerous even when it seems our estimates are correct. The blue lines seem to fit the data, but those simple plots (and regressions) fail to account for correlations between inputs.

And the problem is that you can't predict omitted variable bias a priori. Let's repeat that last analysis but simply change the data generating process slightly:

set.seed(1)
x6 <- rnorm(1000)
z6 <- x6 + rnorm(1000, 0, 1.5)
y6 <- x6 - z6 + rnorm(1000)  #' this is the only differences from the previous example
cor(x6, z6)
## [1] 0.5565
ols6 <- lm(y6 ~ x6 + z6)
ols6a <- lm(y6 ~ x6)
ols6b <- lm(y6 ~ z6)
layout(matrix(1:2, ncol = 2))
plot(y6 ~ x6, col = "gray")
abline(a = coef(ols6a)[1], b = coef(ols6a)["x6"], col = "blue")
abline(a = coef(ols6)[1] + mean(z6), b = coef(ols6)["x6"], col = "red")
plot(y6 ~ z6, col = "gray")
coef(lm(ols6a$residuals ~ z6))["z6"]
##      z6 
## -0.6802
abline(a = coef(ols6b)[1], b = coef(ols6b)["z6"], col = "blue")
abline(a = coef(ols6)[1] + mean(x6), b = coef(ols6)["z6"], col = "red")

plot of chunk unnamed-chunk-34

The blue lines seem to fit the data, but they're biased estimates.

Regression coefficient plots

A contemporary way of presenting regression results involves converting a regression table into a figure.

set.seed(500)
x1 <- rnorm(100, 5, 5)
x2 <- rnorm(100, -2, 10)
x3 <- rnorm(100, 0, 20)
y <- (1 * x1) + (-2 * x2) + (3 * x3) + rnorm(100, 0, 20)
ols2 <- lm(y ~ x1 + x2 + x3)

Conventionally, we would present results from this regression as a table:

summary(ols2)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -53.89 -12.52   2.67  11.24  46.85 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.0648     2.6053   -0.02    0.980    
## x1            1.2211     0.3607    3.39    0.001 ** 
## x2           -2.0941     0.1831  -11.44   <2e-16 ***
## x3            3.0086     0.1006   29.90   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.1 on 96 degrees of freedom
## Multiple R-squared:  0.913,  Adjusted R-squared:  0.91 
## F-statistic:  335 on 3 and 96 DF,  p-value: <2e-16

Or just:

coef(summary(ols2))[, 1:2]
##             Estimate Std. Error
## (Intercept) -0.06483     2.6053
## x1           1.22113     0.3607
## x2          -2.09407     0.1831
## x3           3.00856     0.1006

It might be helpful to see the size and significance of these effects as a figure. To do so, we have to draw the regression slopes as points and the SEs as lines.

slopes <- coef(summary(ols2))[c("x1", "x2", "x3"), 1]  #' slopes
ses <- coef(summary(ols2))[c("x1", "x2", "x3"), 2]  #' SEs

We'll draw the slopes of the three input variables. Note: The interpretation of the following plot depends on input variables that have comparable scales. Note (continued): Comparing dissimilar variables with this visualization can be misleading!

Plotting Standard Errors

Let's construct a plot that draws 1 and 2 SEs for each coefficient:

We'll start with a blank plot (like a blank canvas):

plot(NA, xlim = c(-3, 3), ylim = c(0, 4), xlab = "Slope", ylab = "", yaxt = "n")
# We can add a title:
title("Regression Results")
# We'll add a y-axis labelling our variables:
axis(2, 1:3, c("x1", "x2", "x3"), las = 2)
# We'll add a vertical line for zero:
abline(v = 0, col = "gray")
# Then we'll draw our slopes as points (`pch` tells us what type of point):
points(slopes, 1:3, pch = 23, col = "black", bg = "black")
# Then we'll add thick line segments for each 1 SE:
segments((slopes - ses)[1], 1, (slopes + ses)[1], 1, col = "black", lwd = 2)
segments((slopes - ses)[2], 2, (slopes + ses)[2], 2, col = "black", lwd = 2)
segments((slopes - ses)[3], 3, (slopes + ses)[3], 3, col = "black", lwd = 2)
# Then we'll add thin line segments for the 2 SEs:
segments((slopes - (2 * ses))[1], 1, (slopes + (2 * ses))[1], 1, col = "black", 
    lwd = 1)
segments((slopes - (2 * ses))[2], 2, (slopes + (2 * ses))[2], 2, col = "black", 
    lwd = 1)
segments((slopes - (2 * ses))[3], 3, (slopes + (2 * ses))[3], 3, col = "black", 
    lwd = 1)

plot of chunk unnamed-chunk-5

Plotting Confidence Intervals

We can draw a similar plot with confidence intervals instead of SEs.

plot(NA, xlim = c(-3, 3), ylim = c(0, 4), xlab = "Slope", ylab = "", yaxt = "n")
title("Regression Results")
axis(2, 1:3, c("x1", "x2", "x3"), las = 2)
abline(v = 0, col = "gray")
points(slopes, 1:3, pch = 23, col = "black", bg = "black")
# Then we'll add thick line segments for each 67% CI: Note: The `qnorm`
# function tells us how much to multiple our SEs by to get Gaussian CIs.
# Note: We'll also use vectorization here to save having to retype the
# `segments` command for each line:
segments((slopes - (qnorm(0.835) * ses)), 1:3, (slopes + (qnorm(0.835) * ses)), 
    1:3, col = "black", lwd = 3)
# Then we'll add medium line segments for the 95%:
segments((slopes - (qnorm(0.975) * ses)), 1:3, (slopes + (qnorm(0.975) * ses)), 
    1:3, col = "black", lwd = 2)
# Then we'll add thin line segments for the 99%:
segments((slopes - (qnorm(0.995) * ses)), 1:3, (slopes + (qnorm(0.995) * ses)), 
    1:3, col = "black", lwd = 1)

plot of chunk unnamed-chunk-6

Both of these plots are similar, but show how the size, relative size, and significance of regression slopes can easily be summarized visually.

Note: We can also extract confidence intervals for our model terms directly using the confint function applied to our modle object and then plot those CIs using segments:

ci67 <- confint(ols2, c("x1", "x2", "x3"), level = 0.67)
ci95 <- confint(ols2, c("x1", "x2", "x3"), level = 0.95)
ci99 <- confint(ols2, c("x1", "x2", "x3"), level = 0.99)

Now draw the plot:

plot(NA, xlim = c(-3, 3), ylim = c(0, 4), xlab = "Slope", ylab = "", yaxt = "n")
title("Regression Results")
axis(2, 1:3, c("x1", "x2", "x3"), las = 2)
abline(v = 0, col = "gray")
points(slopes, 1:3, pch = 23, col = "black", bg = "black")
# add the confidence intervals:
segments(ci67[, 1], 1:3, ci67[, 2], 1:3, col = "black", lwd = 3)
segments(ci95[, 1], 1:3, ci95[, 2], 1:3, col = "black", lwd = 2)
segments(ci99[, 1], 1:3, ci99[, 2], 1:3, col = "black", lwd = 1)

plot of chunk unnamed-chunk-8

Comparable effect sizes

One of the major problems (noted above) with these kinds of plots is that in order for them to make visual sense, the underlying covariates have to be inherently comparable. By showing slopes, the plot shows the effect of a unit change in each covariate on the outcome, but unit changes may not be comparable across variables. We could probably come up with an infinite number of ways of presenting the results, but let's focus on two here: plotting standard deviation changes in covariates and plotting minimum to maximum changes in scale of covariates.

Standard deviation changes in X

Let's recall the values of our coefficients on x1, x2, and x3:

coef(summary(ols2))[, 1:2]
##             Estimate Std. Error
## (Intercept) -0.06483     2.6053
## x1           1.22113     0.3607
## x2          -2.09407     0.1831
## x3           3.00856     0.1006

On face value, x3 has the largest effect, but what happens when we account for different standard deviations of the covariates:

sd(x1)
## [1] 5.311
sd(x2)
## [1] 10.48
sd(x3)
## [1] 19.07

x1 clearly also has the largest variance, so it may make more sense to compare a standard deviation change across the variables. To do that is relatively simple because we're working in a linear model, so we simply need to calculate the standard deviation of each covariate and multiply that by the respective coefficient:

c1 <- coef(summary(ols2))[-1, 1:2]  # drop the intercept
c2 <- numeric(length = 3)
c2[1] <- c1[1, 1] * sd(x1)
c2[2] <- c1[2, 1] * sd(x2)
c2[3] <- c1[3, 1] * sd(x3)

Then we'll get standard errors for those changes:

s2 <- numeric(length = 3)
s2[1] <- c1[1, 2] * sd(x1)
s2[2] <- c1[2, 2] * sd(x2)
s2[3] <- c1[3, 2] * sd(x3)

Then we can plot the results:

plot(c2, 1:3, pch = 23, col = "black", bg = "black", xlim = c(-25, 65), ylim = c(0, 
    4), xlab = "Slope", ylab = "", yaxt = "n")
title("Regression Results")
axis(2, 1:3, c("x1", "x2", "x3"), las = 2)
abline(v = 0, col = "gray")
# Then we'll add medium line segments for the 95%:
segments((c2 - (qnorm(0.975) * s2)), 1:3, (c2 + (qnorm(0.975) * s2)), 1:3, col = "black", 
    lwd = 2)
# Then we'll add thin line segments for the 99%:
segments((c2 - (qnorm(0.995) * s2)), 1:3, (c2 + (qnorm(0.995) * s2)), 1:3, col = "black", 
    lwd = 1)

plot of chunk unnamed-chunk-13

By looking at standard deviation changes (focus on the scale of the x-axis), we can see that x3 actually has the largest effect by a much larger factor than we saw in the raw slopes. Moving the same relative amount up each covariate's distribution produces substantially different effects on the outcome.

Full scale changes in X

Another way to visualize effect sizes is to examine the effect of full scale changes in covariates. This is especially useful when deal within covariates that differ dramatically in scale (e.g., a mix of discrete and continuous variables). The basic calculations for these kinds of plots are the same as in the previous plot, but instead of using sd, we use diff(range()), which tells us what a full scale change is in the units of each covariate:

c3 <- numeric(length = 3)
c3[1] <- c1[1, 1] * diff(range(x1))
c3[2] <- c1[2, 1] * diff(range(x2))
c3[3] <- c1[3, 1] * diff(range(x3))

Then we'll get standard errors for those changes:

s3 <- numeric(length = 3)
s3[1] <- c1[1, 2] * diff(range(x1))
s3[2] <- c1[2, 2] * diff(range(x2))
s3[3] <- c1[3, 2] * diff(range(x3))

Then we can plot the results:

plot(c3, 1:3, pch = 23, col = "black", bg = "black", xlim = c(-150, 300), ylim = c(0, 
    4), xlab = "Slope", ylab = "", yaxt = "n")
title("Regression Results")
axis(2, 1:3, c("x1", "x2", "x3"), las = 2)
abline(v = 0, col = "gray")
# Then we'll add medium line segments for the 95%:
segments((c3 - (qnorm(0.975) * s3)), 1:3, (c3 + (qnorm(0.975) * s3)), 1:3, col = "black", 
    lwd = 2)
# Then we'll add thin line segments for the 99%:
segments((c3 - (qnorm(0.995) * s3)), 1:3, (c3 + (qnorm(0.995) * s3)), 1:3, col = "black", 
    lwd = 1)

plot of chunk unnamed-chunk-16

Focusing on the x-axes of the last three plots, we see how differences in scaling of the covariates can lead to vastly different visual interpretations of effect sizes. Plotting the slopes directly suggested that x3 had an effect about three times larger than the effect of x1. Plotting standard deviation changes suggested that x3 had an effect about 10 times larger than the effect of x1 and plotting full scale changes in covariates showed a similar substantive conclusion. While each showed that x3 had the largest effect, interpreting the relative contribution of the different variables depends upon how much variance we would typically see in each variable in our data. The unit-change effect (represented by the slope) may not be the effect size that we ultimately care about for each covariate.

Regular expressions

# as.character() toupper() tolower()

# match() pmatch()

# ? regex [Regular Expressions
# (Wikipedia)](http://en.wikipedia.org/wiki/Regular_expression)

# grep() grepl()

# regexpr() gregexpr() regexec()

# agrep()


# regmatches()

# sub() gsub()

Saving R Data

We frequently need to save our data after we have worked on it for some time (e.g., because we've created scaled or deleted variables, created a subset of our original data, modified the data in a time- or processor-intensive way, or simply need to share a subset of the data). In most statistical packages, this is done automatically: those packages open a file and “destructively” make changes to the original file. This can be convenient, but it is also problematic. If I change a file and don't save the original, my work is no longer reproducible from the original file. It essentially builds a step into the scientific workflow that is not explicitly recorded. R does things differently. When opening a data file in R, the data are read into memory and the link between those data in memory and the original file is severed. Changes made to the data are kept only in R and they are lost if R is closed without the data being saved. This is usually fine because good workflow involves writing scripts that work from the original data, make any necessary changes, and then produce output. But, for the reasons stated above, we might want to save our working data for use later on. R provides at least four ways to do this. Note: All of the methods overwrite the system file by default. This means that writing a file over an existing file is “destructive,” so it's a good idea to make sure that you're not overwriting a file by checking to make sure your filename isn't already in use using list.files(). By default, the file is written to your working directory (getwd()) but can be written elsewhere if you supply a file path rather than name.

All of these methods work with an R dataframe, so we'll create a simple one just for the sake of demonstration:

set.seed(1)
mydf <- data.frame(x = rnorm(100), y = rnorm(100), z = rnorm(100))

“” save ## The most flexible way to save data objects from R uses the save function. By default, save writes an R object (or multiple R objects) to an R-readable binary file that can be opened using load. Because save can store multiple objects (including one's entire current workspace), it provides a very flexible way to “pick up where you left off.” For example, using save.image('myworkspace.RData'), you could save everything about your current R workspace, and then load('myworkspace.RData') later and be exactly where you were before. But it is also a convenient way to write data to a file that you plan to use again in R. Because it saves R objects “as-is,” there's no need to worry about problems reading in the data or needing to change structure or variable names because the file is saved (and will load) exactly as it looks in R. The dataframe will even have the same name (i.e., in our example, the loaded object will be caleld mydf). The .RData file format is also very space-saving, thus taking up less room than a comparable comma-separated variable file containing the same data. To write our dataframe using save, we simply supply the name of the dataframe and the destination file:

save(mydf, file = "saveddf.RData")

Note that the file name is not important (so long as it does not overwrite another file in your working directory). If you load the file using load, the R object mydf will appear in your workspace. Let's remove the file just to not leave a mess:

unlink("saveddf.RData")

dput (and dget)

Sometimes we want to be able to write our data in a way that makes it exactly reproducible (like save), but we also want to be able to read the file. Because save creates a binary file, we can only open the file in R (or another piece of software that reads .RData files). If we want, for example, to be able to look at or change the file in a text editor, we need it in another format. One R-specific solution for this is dput. The dput function saves data as an R expression. This means that the resulting file can actually be copied and pasted into the R console. This is especially helpful if you want to share (part of) your data with someone else. Indeed, it is rquired that when you ask data-related questions on StackOverflow, that you supply your data using dput to make it easy for people to help you. We can also simply write the output of dput to the console to see what it looks like. Let's try that before writing it to a file:

dput(mydf)
## structure(list(x = c(-0.626453810742332, 0.183643324222082, -0.835628612410047, 
## 1.59528080213779, 0.329507771815361, -0.820468384118015, 0.487429052428485, 
## 0.738324705129217, 0.575781351653492, -0.305388387156356, 1.51178116845085, 
## 0.389843236411431, -0.621240580541804, -2.2146998871775, 1.12493091814311, 
## -0.0449336090152309, -0.0161902630989461, 0.943836210685299, 
## 0.821221195098089, 0.593901321217509, 0.918977371608218, 0.782136300731067, 
## 0.0745649833651906, -1.98935169586337, 0.61982574789471, -0.0561287395290008, 
## -0.155795506705329, -1.47075238389927, -0.47815005510862, 0.417941560199702, 
## 1.35867955152904, -0.102787727342996, 0.387671611559369, -0.0538050405829051, 
## -1.37705955682861, -0.41499456329968, -0.394289953710349, -0.0593133967111857, 
## 1.10002537198388, 0.763175748457544, -0.164523596253587, -0.253361680136508, 
## 0.696963375404737, 0.556663198673657, -0.68875569454952, -0.70749515696212, 
## 0.36458196213683, 0.768532924515416, -0.112346212150228, 0.881107726454215, 
## 0.398105880367068, -0.612026393250771, 0.341119691424425, -1.12936309608079, 
## 1.43302370170104, 1.98039989850586, -0.367221476466509, -1.04413462631653, 
## 0.569719627442413, -0.135054603880824, 2.40161776050478, -0.0392400027331692, 
## 0.689739362450777, 0.0280021587806661, -0.743273208882405, 0.188792299514343, 
## -1.80495862889104, 1.46555486156289, 0.153253338211898, 2.17261167036215, 
## 0.475509528899663, -0.709946430921815, 0.610726353489055, -0.934097631644252, 
## -1.2536334002391, 0.291446235517463, -0.443291873218433, 0.00110535163162413, 
## 0.0743413241516641, -0.589520946188072, -0.568668732818502, -0.135178615123832, 
## 1.1780869965732, -1.52356680042976, 0.593946187628422, 0.332950371213518, 
## 1.06309983727636, -0.304183923634301, 0.370018809916288, 0.267098790772231, 
## -0.54252003099165, 1.20786780598317, 1.16040261569495, 0.700213649514998, 
## 1.58683345454085, 0.558486425565304, -1.27659220845804, -0.573265414236886, 
## -1.22461261489836, -0.473400636439312), y = c(-0.620366677224124, 
## 0.0421158731442352, -0.910921648552446, 0.158028772404075, -0.654584643918818, 
## 1.76728726937265, 0.716707476017206, 0.910174229495227, 0.384185357826345, 
## 1.68217608051942, -0.635736453948977, -0.461644730360566, 1.43228223854166, 
## -0.650696353310367, -0.207380743601965, -0.392807929441984, -0.319992868548507, 
## -0.279113302976559, 0.494188331267827, -0.177330482269606, -0.505957462114257, 
## 1.34303882517041, -0.214579408546869, -0.179556530043387, -0.100190741213562, 
## 0.712666307051405, -0.0735644041263263, -0.0376341714670479, 
## -0.681660478755657, -0.324270272246319, 0.0601604404345152, -0.588894486259664, 
## 0.531496192632572, -1.51839408178679, 0.306557860789766, -1.53644982353759, 
## -0.300976126836611, -0.528279904445006, -0.652094780680999, -0.0568967778473925, 
## -1.91435942568001, 1.17658331201856, -1.664972436212, -0.463530401472386, 
## -1.11592010504285, -0.750819001193448, 2.08716654562835, 0.0173956196932517, 
## -1.28630053043433, -1.64060553441858, 0.450187101272656, -0.018559832714638, 
## -0.318068374543844, -0.929362147453702, -1.48746031014148, -1.07519229661568, 
## 1.00002880371391, -0.621266694796823, -1.38442684738449, 1.86929062242358, 
## 0.425100377372448, -0.238647100913033, 1.05848304870902, 0.886422651374936, 
## -0.619243048231147, 2.20610246454047, -0.255027030141015, -1.42449465021281, 
## -0.144399601954219, 0.207538339232345, 2.30797839905936, 0.105802367893711, 
## 0.456998805423414, -0.077152935356531, -0.334000842366544, -0.0347260283112762, 
## 0.787639605630162, 2.07524500865228, 1.02739243876377, 1.2079083983867, 
## -1.23132342155804, 0.983895570053379, 0.219924803660651, -1.46725002909224, 
## 0.521022742648139, -0.158754604716016, 1.4645873119698, -0.766081999604665, 
## -0.430211753928547, -0.926109497377437, -0.17710396143654, 0.402011779486338, 
## -0.731748173119606, 0.830373167981674, -1.20808278630446, -1.04798441280774, 
## 1.44115770684428, -1.01584746530465, 0.411974712317515, -0.38107605110892
## ), z = c(0.409401839650934, 1.68887328620405, 1.58658843344197, 
## -0.330907800682766, -2.28523553529247, 2.49766158983416, 0.667066166765493, 
## 0.5413273359637, -0.0133995231459087, 0.510108422952926, -0.164375831769667, 
## 0.420694643254513, -0.400246743977644, -1.37020787754746, 0.987838267454879, 
## 1.51974502549955, -0.308740569225614, -1.25328975560769, 0.642241305677824, 
## -0.0447091368939791, -1.73321840682484, 0.00213185968026965, 
## -0.630300333928146, -0.340968579860405, -1.15657236263585, 1.80314190791747, 
## -0.331132036391221, -1.60551341225308, 0.197193438739481, 0.263175646405474, 
## -0.985826700409291, -2.88892067167955, -0.640481702565115, 0.570507635920485, 
## -0.05972327604261, -0.0981787440052344, 0.560820728620116, -1.18645863857947, 
## 1.09677704427424, -0.00534402827816569, 0.707310667398079, 1.03410773473746, 
## 0.223480414915304, -0.878707612866019, 1.16296455596733, -2.00016494478548, 
## -0.544790740001725, -0.255670709156989, -0.166121036765006, 1.02046390878411, 
## 0.136221893102778, 0.407167603423836, -0.0696548130129049, -0.247664341619331, 
## 0.69555080661964, 1.1462283572158, -2.40309621489187, 0.572739555245841, 
## 0.374724406778655, -0.425267721556076, 0.951012807576816, -0.389237181718379, 
## -0.284330661799574, 0.857409778079803, 1.7196272991206, 0.270054900937229, 
## -0.42218400978764, -1.18911329485959, -0.33103297887901, -0.939829326510021, 
## -0.258932583118785, 0.394379168221572, -0.851857092023863, 2.64916688109488, 
## 0.156011675665079, 1.13020726745494, -2.28912397984011, 0.741001157195439, 
## -1.31624516045156, 0.919803677609141, 0.398130155451956, -0.407528579269772, 
## 1.32425863017727, -0.70123166924692, -0.580614304240536, -1.00107218102542, 
## -0.668178606753393, 0.945184953373082, 0.433702149545162, 1.00515921767704, 
## -0.390118664053679, 0.376370291774648, 0.244164924486494, -1.42625734238254, 
## 1.77842928747545, 0.134447660933676, 0.765598999157864, 0.955136676908982, 
## -0.0505657014422701, -0.305815419766971)), .Names = c("x", "y", 
## "z"), row.names = c(NA, -100L), class = "data.frame")

As you can see, output is a complicated R expression (using the structure function), which includes all of the data values, the variable names, row names, and the class of the object. If you were to copy and paste this output into a new R session, you would have the exact same dataframe as the one we created here. We can write this to a file (with any extension) by specifying a file argument:

dput(mydf, "saveddf.txt")

I would tend to use the .txt (text file) extension, so that it will be easily openable in any text editor, but you can use any extension. Note: Unlike save and load, which store an R object and then restore it using the save name, dput does not store the name of the R object. So, if we want to load the dataframe again (using dget), we need to store the dataframe as a variable:

mydf2 <- dget("saveddf.txt")

Additionally, and again unlike save, dput only stores values up to a finite level of precision. So while our original mydf and the read-back-in dataframe mydf2 look very similar, they differ due the rules of floating point values (a basic element of computer programming that is unimportant to really understand):

head(mydf)
##         x        y       z
## 1 -0.6265 -0.62037  0.4094
## 2  0.1836  0.04212  1.6889
## 3 -0.8356 -0.91092  1.5866
## 4  1.5953  0.15803 -0.3309
## 5  0.3295 -0.65458 -2.2852
## 6 -0.8205  1.76729  2.4977
head(mydf2)
##         x        y       z
## 1 -0.6265 -0.62037  0.4094
## 2  0.1836  0.04212  1.6889
## 3 -0.8356 -0.91092  1.5866
## 4  1.5953  0.15803 -0.3309
## 5  0.3295 -0.65458 -2.2852
## 6 -0.8205  1.76729  2.4977
mydf == mydf2
##            x     y     z
##   [1,] FALSE FALSE FALSE
##   [2,] FALSE FALSE FALSE
##   [3,] FALSE FALSE FALSE
##   [4,] FALSE FALSE FALSE
##   [5,] FALSE FALSE FALSE
##   [6,] FALSE FALSE FALSE
##   [7,] FALSE FALSE  TRUE
##   [8,] FALSE FALSE FALSE
##   [9,] FALSE FALSE FALSE
##  [10,]  TRUE FALSE FALSE
##  [11,] FALSE  TRUE FALSE
##  [12,] FALSE  TRUE FALSE
##  [13,] FALSE FALSE FALSE
##  [14,]  TRUE FALSE FALSE
##  [15,] FALSE FALSE FALSE
##  [16,] FALSE FALSE FALSE
##  [17,] FALSE FALSE FALSE
##  [18,] FALSE FALSE FALSE
##  [19,] FALSE  TRUE FALSE
##  [20,] FALSE FALSE FALSE
##  [21,] FALSE FALSE FALSE
##  [22,] FALSE FALSE FALSE
##  [23,]  TRUE FALSE FALSE
##  [24,] FALSE FALSE FALSE
##  [25,] FALSE FALSE FALSE
##  [26,] FALSE FALSE FALSE
##  [27,] FALSE FALSE FALSE
##  [28,] FALSE FALSE FALSE
##  [29,] FALSE FALSE FALSE
##  [30,] FALSE FALSE FALSE
##  [31,] FALSE FALSE FALSE
##  [32,] FALSE FALSE FALSE
##  [33,] FALSE FALSE FALSE
##  [34,] FALSE FALSE FALSE
##  [35,] FALSE FALSE FALSE
##  [36,] FALSE FALSE FALSE
##  [37,] FALSE FALSE FALSE
##  [38,] FALSE FALSE FALSE
##  [39,] FALSE FALSE FALSE
##  [40,] FALSE FALSE FALSE
##  [41,] FALSE FALSE FALSE
##  [42,] FALSE FALSE FALSE
##  [43,] FALSE FALSE FALSE
##  [44,] FALSE FALSE FALSE
##  [45,] FALSE FALSE FALSE
##  [46,] FALSE FALSE FALSE
##  [47,] FALSE FALSE FALSE
##  [48,] FALSE FALSE FALSE
##  [49,] FALSE FALSE FALSE
##  [50,] FALSE FALSE FALSE
##  [51,] FALSE FALSE FALSE
##  [52,] FALSE FALSE FALSE
##  [53,] FALSE FALSE FALSE
##  [54,] FALSE FALSE FALSE
##  [55,] FALSE FALSE FALSE
##  [56,]  TRUE FALSE FALSE
##  [57,] FALSE FALSE FALSE
##  [58,] FALSE FALSE FALSE
##  [59,] FALSE FALSE FALSE
##  [60,] FALSE FALSE FALSE
##  [61,] FALSE FALSE FALSE
##  [62,] FALSE FALSE FALSE
##  [63,] FALSE  TRUE FALSE
##  [64,] FALSE FALSE FALSE
##  [65,] FALSE FALSE FALSE
##  [66,] FALSE FALSE FALSE
##  [67,] FALSE FALSE FALSE
##  [68,] FALSE FALSE FALSE
##  [69,] FALSE FALSE FALSE
##  [70,] FALSE FALSE FALSE
##  [71,] FALSE FALSE FALSE
##  [72,] FALSE FALSE FALSE
##  [73,] FALSE FALSE FALSE
##  [74,] FALSE FALSE  TRUE
##  [75,] FALSE FALSE FALSE
##  [76,] FALSE FALSE FALSE
##  [77,]  TRUE FALSE FALSE
##  [78,] FALSE FALSE FALSE
##  [79,] FALSE FALSE FALSE
##  [80,] FALSE FALSE FALSE
##  [81,] FALSE FALSE FALSE
##  [82,] FALSE FALSE FALSE
##  [83,] FALSE FALSE FALSE
##  [84,] FALSE FALSE FALSE
##  [85,] FALSE FALSE FALSE
##  [86,] FALSE  TRUE FALSE
##  [87,] FALSE FALSE FALSE
##  [88,] FALSE FALSE FALSE
##  [89,] FALSE FALSE FALSE
##  [90,] FALSE FALSE FALSE
##  [91,] FALSE FALSE FALSE
##  [92,] FALSE FALSE FALSE
##  [93,] FALSE FALSE FALSE
##  [94,] FALSE FALSE FALSE
##  [95,] FALSE FALSE FALSE
##  [96,] FALSE FALSE FALSE
##  [97,] FALSE FALSE FALSE
##  [98,] FALSE FALSE FALSE
##  [99,] FALSE FALSE FALSE
## [100,] FALSE FALSE  TRUE

Thus, a dataframe saved using save is exactly the same when reloaded into R whereas the one saved using dput is the same up to a lesser degree of precision.

Let's clean up that file so not to leave a mess:

unlink("saveddf.text")

dump (and source)

Similar to dput, the dump function writes the dput output to a file. Indeed, it write the exact same representation we saw above on the console. But, instead of writing an R expression that we have to save to a variable name later, dump preserves the name of our dataframe. Thus it is a blend between dput and save (but mostly it is like dput). dump also uses a default filename: "dumpdata.R", making it a shorter command to write and one that is less likely to be destructive (except to previous data dumps). Let's see how it works:

dump("mydf")

Note: We specify the dataframe name as a character string because this is written to the file so that when we load the "dumpdata.R" file, the dataframe has the same name as it does right now. We can load this dataframe into memory from the file using source:

source("dumpdata.R", echo = TRUE)
## 
## > mydf <-
## + structure(list(x = c(-0.626453810742332, 0.183643324222082, -0.835628612410047, 
## + 1.59528080213779, 0.329507771815361, -0.820468384118015 .... [TRUNCATED]

As you'll see in the (truncated) output of source, the file looks just like dput but includes mydf <- at the beginning, meaning it s storing the dput-like output into the mydf object in R memory. Note: dump can also take arbitrary file names to its file argument (like the save and dput).

Let's clean up that file so not to leave a mess:

unlink("dumpdata.R")

write.csv and write.table

One of the easiest ways to save an R dataframe is to write it to a comma-separated value (CSV) file. CSV files are human-readable (e.g., in a text editor) and can be opened by essentially any statistical software (Excel, Stata, SPSS, SAS, etc.) making them one of the best formats for data sharing. To save a dataframe as CSV is easy. You simply need to use the write.csv function with the name of the dataframe and the name of the file you want to write to. Let's see how it works:

write.csv(mydf, file = "saveddf.csv")

That's all there is to it. R also allows you to save files in other CSV-like formats. For example, sometimes we want to save data using a different separator such as a tab (i.e., to create a tab-separated value file or TSV). The TSV is, for example, the default file format used by The Dataverse Network online data repository. To write to a TSV we use a related function write.table and specify the sep argument:

write.table(mydf, file = "saveddf.tsv", sep = "\t")

Note: We use the \t symbol to represent a tab (a standard common to many programming languages). We could also specify any character as a separator, such as | or ; or . but commas and tabs are the most common. Note: Just like dput, writing to a CSV or another delimited-format file necessarily includes some loss of precision, which may or may not be problematic for your particular use case.

Let's clean up our files just so we don't leave a mess:

unlink("savedf.csv")
unlink("savedf.tsv")

Writing to “foreign” file formats

The foreign package, which we can use to load “foreign” file formats also includes a write.foreign function that can be used to write an R dataframe to a foreign, proprietary data format. Supported formats include SPSS, Stata, and SAS.

Scale construction

One of the most common analytic tasks is creating variables. For example, we have some variable that we need to use in the analysis, but we want it to have a mean of zero or be confined to [0,1]. Alternatively, we might have a large number of indicators that we need to aggregate into a single variable. When we used R as a calculator, we learned that R is “vectorized”. This means that when we call a function like add (+), it adds each respective element of two vectors together. For example:

(1:3) + (10:12)
## [1] 11 13 15

This returns a three-element vector that added each corresponding element of the two vectors together. We also should remember R's tendency to use “recyling”:

(1:3) + 10
## [1] 11 12 13

Here, the second vector only has one element, so R assumes that you want to add 10 to each element of the first vector (as opposed to adding 10 to the first element and nothing to the second and third elements). This is really helpful for preparing data vectors because it means we can use mathematical operators (addition, subtraction, multiplication, division, powers, logs, etc.) for their intuitive purposes when trying to create new variables rather than having to rely on obscure function names. But R also has a number of other functions for building variables.

Let's examine all of these features using some made-up data. In this case, we'll create a dataframe of indicator variables (coded 0 and 1) and build them into various scales.

set.seed(1)
n <- 30
mydf <- data.frame(x1 = rbinom(n, 1, 0.5), x2 = rbinom(n, 1, 0.1), x3 = rbinom(n, 
    1, 0.5), x4 = rbinom(n, 1, 0.8), x5 = 1, x6 = sample(c(0, 1, NA), n, TRUE))

Let's use str and summary to get a quick sense of the data:

str(mydf)
## 'data.frame':    30 obs. of  6 variables:
##  $ x1: int  0 0 1 1 0 1 1 1 1 0 ...
##  $ x2: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ x3: int  1 0 0 0 1 0 0 1 0 1 ...
##  $ x4: int  1 1 1 0 1 1 1 1 0 1 ...
##  $ x5: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ x6: num  NA 1 1 0 NA 1 1 0 0 1 ...
summary(mydf)
##        x1              x2          x3            x4              x5   
##  Min.   :0.000   Min.   :0   Min.   :0.0   Min.   :0.000   Min.   :1  
##  1st Qu.:0.000   1st Qu.:0   1st Qu.:0.0   1st Qu.:1.000   1st Qu.:1  
##  Median :0.000   Median :0   Median :0.0   Median :1.000   Median :1  
##  Mean   :0.467   Mean   :0   Mean   :0.4   Mean   :0.833   Mean   :1  
##  3rd Qu.:1.000   3rd Qu.:0   3rd Qu.:1.0   3rd Qu.:1.000   3rd Qu.:1  
##  Max.   :1.000   Max.   :0   Max.   :1.0   Max.   :1.000   Max.   :1  
##                                                                       
##        x6       
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :1.000  
##  Mean   :0.591  
##  3rd Qu.:1.000  
##  Max.   :1.000  
##  NA's   :8

All variables are coded 0 or 1, x5 is all 1's, and x6 contains some missing data (NA) values.

Simple scaling

The easiest scales are those that add or substract variables. Let's try that quick:

mydf$x1 + mydf$x2
##  [1] 0 0 1 1 0 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1 1 0 1 0 0 0 0 0 1 0
mydf$x1 + mydf$x2 + mydf$x3
##  [1] 1 0 1 1 1 1 1 2 1 1 0 1 1 0 1 1 2 1 1 2 1 1 1 0 1 0 1 0 1 0
mydf$x1 + mydf$x2 - mydf$x3
##  [1] -1  0  1  1 -1  1  1  0  1 -1  0 -1  1  0  1 -1  0  1 -1  0  1 -1  1
## [24]  0 -1  0 -1  0  1  0

One way to save some typing is to use the with command, which simply tells R which dataframe to look in for variables:

with(mydf, x1 + x2 - x3)
##  [1] -1  0  1  1 -1  1  1  0  1 -1  0 -1  1  0  1 -1  0  1 -1  0  1 -1  1
## [24]  0 -1  0 -1  0  1  0

A faster way to take a rowsum is to use rowSums:

rowSums(mydf)
##  [1] NA  3  4  2 NA  4  4  4  2  4  3  3  3  2 NA  4  5  4 NA  5 NA  4  3
## [24]  2 NA  3  3 NA  3 NA

Because we have missing data, any row that has an NA results in a sum of 0. We could either skip that column:

rowSums(mydf[, 1:5])
##  [1] 3 2 3 2 3 3 3 4 2 3 2 3 3 1 3 3 4 3 2 4 2 3 3 2 3 2 3 2 3 2

or use the na.rm=TRUE argument to skip NA values when calculating the sum:

rowSums(mydf, na.rm = TRUE)
##  [1] 3 3 4 2 3 4 4 4 2 4 3 3 3 2 3 4 5 4 2 5 2 4 3 2 3 3 3 2 3 2

or we could look at a reduced dataset, eliminating all rows from the result that have a missing value:

rowSums(na.omit(mydf))
##  2  3  4  6  7  8  9 10 11 12 13 14 16 17 18 20 22 23 24 26 27 29 
##  3  4  2  4  4  4  2  4  3  3  3  2  4  5  4  5  4  3  2  3  3  3

but this last option can create problems if we try to store the result back into our original data (since it has fewer elements than the original dataframe has rows).

We can also multiply (or divide) across variables. For these indicator variables, that applies an AND logic to tell us if all of the variables are 1:

with(mydf, x3 * x4 * x5)
##  [1] 1 0 0 0 1 0 0 1 0 1 0 1 0 0 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0

We might also want to take an average value across all the columns, which we could do by hand:

with(mydf, x1 + x2 + x3 + x4 + x5 + x6)/6
##  [1]     NA 0.5000 0.6667 0.3333     NA 0.6667 0.6667 0.6667 0.3333 0.6667
## [11] 0.5000 0.5000 0.5000 0.3333     NA 0.6667 0.8333 0.6667     NA 0.8333
## [21]     NA 0.6667 0.5000 0.3333     NA 0.5000 0.5000     NA 0.5000     NA

or use the rowSums function from earlier:

rowSums(mydf)/6
##  [1]     NA 0.5000 0.6667 0.3333     NA 0.6667 0.6667 0.6667 0.3333 0.6667
## [11] 0.5000 0.5000 0.5000 0.3333     NA 0.6667 0.8333 0.6667     NA 0.8333
## [21]     NA 0.6667 0.5000 0.3333     NA 0.5000 0.5000     NA 0.5000     NA

or use the even simpler rowMeans function:

rowMeans(mydf)
##  [1]     NA 0.5000 0.6667 0.3333     NA 0.6667 0.6667 0.6667 0.3333 0.6667
## [11] 0.5000 0.5000 0.5000 0.3333     NA 0.6667 0.8333 0.6667     NA 0.8333
## [21]     NA 0.6667 0.5000 0.3333     NA 0.5000 0.5000     NA 0.5000     NA

If we want to calculate some other kind of function, like the variance, we can use the apply function:

apply(mydf, 1, var)  # the `1` refers to rows
##  [1]     NA 0.3000 0.2667 0.2667     NA 0.2667 0.2667 0.2667 0.2667 0.2667
## [11] 0.3000 0.3000 0.3000 0.2667     NA 0.2667 0.1667 0.2667     NA 0.1667
## [21]     NA 0.2667 0.3000 0.2667     NA 0.3000 0.3000     NA 0.3000     NA

We can also make calculations for columns (though this is less common in rectangular data unless we're trying to create summary statistics):

rowSums(mydf)
##  [1] NA  3  4  2 NA  4  4  4  2  4  3  3  3  2 NA  4  5  4 NA  5 NA  4  3
## [24]  2 NA  3  3 NA  3 NA
rowMeans(mydf)
##  [1]     NA 0.5000 0.6667 0.3333     NA 0.6667 0.6667 0.6667 0.3333 0.6667
## [11] 0.5000 0.5000 0.5000 0.3333     NA 0.6667 0.8333 0.6667     NA 0.8333
## [21]     NA 0.6667 0.5000 0.3333     NA 0.5000 0.5000     NA 0.5000     NA
apply(mydf, 2, var)  # the `2` refers to columns
##     x1     x2     x3     x4     x5     x6 
## 0.2575 0.0000 0.2483 0.1437 0.0000     NA
sapply(mydf, var)  # another way to apply a function to columns
##     x1     x2     x3     x4     x5     x6 
## 0.2575 0.0000 0.2483 0.1437 0.0000     NA

Using indexing in building scales

Sometimes we need to build a scale with a different formula for subsets of a dataset. For example, we want to calculate a scale in one way for men and a different way for women (or something like that). We can use indexing to achieve this. We can start by creating an empty variable with the right number of elements (i.e., the number of rows in our dataframe):

newvar <- numeric(nrow(mydf))

Then we can store values into this conditional on a variable from our dataframe:

newvar[mydf$x1 == 1] <- with(mydf[mydf$x1 == 1, ], x2 + x3)
newvar[mydf$x1 == 0] <- with(mydf[mydf$x1 == 0, ], x3 + x4 + x5)

The key to making that work is using the same index on the new variable as on the original data. Doing otherwise would produce a warning about mismatched lengths:

newvar[mydf$x1 == 1] <- with(mydf, x2 + x3)
## Warning: number of items to replace is not a multiple of replacement
## length

Scatterplot Jittering

Scatterplots are one of the best ways to understand a bivariate relationship. They neatly show the form of the relationship between x and y. But they are really only effective when both variables are continuous. When one of the variables in discrete, boxplots, conditional density plots, and other visualization techniques often do a better job communicating relationships. But sometimes we have discrete data that is almost continuous (e.g., years of formal education). These kinds of variables might be nearly continuous and have approximately linear relationships with other variables. Summarizing an continuous outcome (e.g., income) using a boxplot at every level of education can be pretty tedious and indeed is a difficult graph to read. In these situations, we might want to rely on a scatterplot, but we need to preprocess the data in order to clearly visualize it.

Let's start with some example data (where the predictor variable is discrete and the outcome is continuous), look at the problems with plotting these kinds of data using R's defaults, and then look at the jitter function to draw a better scatterplot.

set.seed(1)
x <- sample(1:10, 200, TRUE)
y <- 3 * x + rnorm(200, 0, 5)

Here's what a standard scatterplot of these data looks like:

plot(y ~ x, pch = 15)

plot of chunk unnamed-chunk-2

Because the independent variable is only observed at a few levels, it can be difficult to get a sense of the “cloud” of points. We can use jitter to add a little random noise to the data in order to see the cloud more clearly:

plot(y ~ jitter(x, 1), pch = 15)

plot of chunk unnamed-chunk-3

We can add even more random noise to see an even more “cloud”-like representation:

plot(y ~ jitter(x, 2), pch = 15)

plot of chunk unnamed-chunk-4

If both our independent and dependent variables are discrete, the value of jitter is even greater. Let's look at some data like this:

x2 <- sample(1:10, 500, TRUE)
y2 <- sample(1:5, 500, TRUE)
plot(y2 ~ x2, pch = 15)

plot of chunk unnamed-chunk-5

Here the data simply look like a grid of points. It is impossible to infer the density of the data anywhere in the plot. jitter will be quite useful. Let's start by applying jitter just to the x2 variable (as we did above):

plot(y2 ~ jitter(x2), pch = 15)

plot of chunk unnamed-chunk-6

Here we start to see teh data a little more clearly. Let's try it just on the outcome:

plot(jitter(y2) ~ x2, pch = 15)

plot of chunk unnamed-chunk-7

That's a similar level of improvement, but let's use jitter on both the outcome and predictor to get a much more cloud-like effect:

plot(jitter(y2) ~ jitter(x2), pch = 15)

plot of chunk unnamed-chunk-8

Adding even more noise will make an even fuller cloud:

plot(jitter(y2, 2) ~ jitter(x2, 2), pch = 15)

plot of chunk unnamed-chunk-9

We now clearly see that our data are evenly dense across the entire matrix. Of course, adding this kind of noise probably isn't appropriate for analyzing data, but we could, e.g., run a regression model on the original data then when we plot the results use the jitter inputs in order to more clearly convey the underlying descriptive relationship.

Scatterplot with marginal rugs

When we want to compare the distributions of two variables in a scatterplot, sometimes it is hard to see the marginal distributions. To observe the marginal distributions more clearly, we can add “rugs” using the rug function. A rug is a one-dimensional density plot drawn on the axis of a plot.

Let's start with some data for two groups.

set.seed(1)
x1 <- rnorm(1000)
x2 <- rbinom(1000, 1, 0.7)
y <- x1 + 5 * x2 + 3 * (x1 * x2) + rnorm(1000, 0, 3)

We can plot the scatterplot for each group separately in red and blue. We can then add some marginal “rugs” to each side. We could do this for all the data or separately for each group. To do it separately for each group, we need to specify the line parameter so that the rugs don't overwrite each other.

plot(x1[x2 == 1], y[x2 == 1], col = "tomato3", xaxt = "n", yaxt = "n", xlab = "", 
    ylab = "", bty = "n")
points(y[x2 == 0] ~ x1[x2 == 0], col = "royalblue3")
# x-axis rugs for each group
rug(x1[x2 == 1], side = 1, line = 0, col = "tomato1", tck = 0.01)
rug(x1[x2 == 0], side = 1, line = 0.5, col = "royalblue1", tck = 0.01)
# y-axis rugs for each group
rug(y[x2 == 1], side = 2, line = 0, col = "tomato1", tck = 0.01)
rug(y[x2 == 0], side = 2, line = 0.5, col = "royalblue1", tck = 0.01)
# Note: The `tck` parameter specifies how tall the rug is. A shorter rug
# uses less ink to communicate the same information.
axis(1, line = 1)
axis(2, line = 1)

plot of chunk unnamed-chunk-2

The last two lines add some axes a little farther out than they normally would be on the plots.

We might also want to add some more descriptives to the plot. For example, the marginal means for each group as a small black line of the rugs:

plot(x1[x2 == 1], y[x2 == 1], col = "tomato3", xaxt = "n", yaxt = "n", xlab = "", 
    ylab = "", bty = "n")
points(y[x2 == 0] ~ x1[x2 == 0], col = "royalblue3")
rug(x1[x2 == 1], side = 1, line = 0, col = "tomato1", tck = 0.01)
rug(x1[x2 == 0], side = 1, line = 0.5, col = "royalblue1", tck = 0.01)
rug(y[x2 == 1], side = 2, line = 0, col = "tomato1", tck = 0.01)
rug(y[x2 == 0], side = 2, line = 0.5, col = "royalblue1", tck = 0.01)
axis(1, line = 1)
axis(2, line = 1)
# means(on x-axis rugs)
Axis(at = mean(x1[x2 == 1]), side = 1, line = 0, labels = "", col = "black", 
    lwd.ticks = 3, tck = 0.01)
Axis(at = mean(x1[x2 == 0]), side = 1, line = 0.5, labels = "", col = "black", 
    lwd.ticks = 3, tck = 0.01)
# means(on y-axis rugs)
Axis(at = mean(y[x2 == 1]), side = 2, line = 0, labels = "", col = "black", 
    lwd.ticks = 3, tck = 0.01)
Axis(at = mean(y[x2 == 0]), side = 2, line = 0.5, labels = "", col = "black", 
    lwd.ticks = 3, tck = 0.01)

plot of chunk unnamed-chunk-3

As should be clear, the means of x1 are similar in both groups, but the means of y in each group differ considerably. By combining the scatterplot with the rug, we are able to communicate considerable information with little ink.

Standardized linear regression coefficients

Sometimes people standardize regression coefficients in order to make them comparable. Gary King thinks this produces apples-to-oranges comparisons. He's right. It is a rare context in which these are helpful.

Let's start with some data:

set.seed(1)
n <- 1000
x1 <- rnorm(n, -1, 10)
x2 <- rnorm(n, 3, 2)
y <- 5 * x1 + x2 + rnorm(n, 1, 2)

Then we can build and summarize a standard linear regression model.

model1 <- lm(y ~ x1 + x2)

The summary shows us unstandardized coefficients that we typically deal with:

summary(model1)
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.230 -1.313 -0.045  1.363  5.626 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.9762     0.1139    8.57   <2e-16 ***
## x1            5.0098     0.0063  795.00   <2e-16 ***
## x2            1.0220     0.0314   32.60   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.06 on 997 degrees of freedom
## Multiple R-squared:  0.998,  Adjusted R-squared:  0.998 
## F-statistic: 3.17e+05 on 2 and 997 DF,  p-value: <2e-16

We might want standardized coefficients in order to make comparisons across the two input variables, which have different means and variances. To do this, we multiply the coefficients by the standard deviation of the input over the standard deviation of the output.

b <- summary(model1)$coef[2:3, 1]
sy <- apply(model1$model[1], 2, sd)
sx <- apply(model1$model[2:3], 2, sd)
betas <- b * (sx/sy)

The result are coefficients for x1 and x2 that we can interpret in the form: “the change in y (in standard deviations) for every standard deviation change in x”

betas
##      x1      x2 
## 0.99811 0.04092

We can obtain the same results by standardizing our variables to begin with:

yt <- (y - mean(y))/sd(y)
x1t <- (x1 - mean(x1))/sd(x1)
x2t <- (x2 - mean(x2))/sd(x2)
model2 <- lm(yt ~ x1t + x2t)

If we compare the result of original model to the results from our manual calculation or our pre-standardized mode, we see that the latter two sets of coefficients are identical, but different from the first.

rbind(model1$coef, model2$coef, c(NA, betas))
##      (Intercept)     x1      x2
## [1,]   9.762e-01 5.0098 1.02202
## [2,]   2.864e-17 0.9981 0.04092
## [3,]          NA 0.9981 0.04092

We can see how these produce the same inference by examining the change in y predicted by one-SD change in x1 from model1:

sd(x1) * model1$coef["x1"]
##    x1 
## 51.85

Dividing that value by the standard deviation of y, we obtain our standardized regression coefficient:

sd(x1) * model1$coef["x1"]/sd(y)
##     x1 
## 0.9981

And the same is true for x2:

sd(x2) * model1$coef["x2"]/sd(y)
##      x2 
## 0.04092

Thus, we obtain the same substantive inference from standardized coefficients. Using them is a matter of what produces the most intuitive story from the data.

Tables

We often want to tabulate data (e.g., categorical data). R supplies tabulation functionality with the table function:

set.seed(1)
a <- sample(1:5, 25, TRUE)
a
##  [1] 2 2 3 5 2 5 5 4 4 1 2 1 4 2 4 3 4 5 2 4 5 2 4 1 2
table(a)
## a
## 1 2 3 4 5 
## 3 8 2 7 5

The result is a table, showing the names of each possible value and a frequency count of for each value. This looks similarly regardless of the class of the vector. Note: If the vector contains continuous data, the result may be unexpected:

table(rnorm(100))
## 
##   -2.22390027400994     -1.563782051071   -1.43758624082998 
##                   1                   1                   1 
##    -1.4250983947325   -1.28459935387219   -1.28074943178832 
##                   1                   1                   1 
##   -1.26361438497058   -1.23753842192996   -1.16657054708471 
##                   1                   1                   1 
##   -1.13038577760069    -1.0655905803883   -0.97228683550556 
##                   1                   1                   1 
##  -0.940649162618608  -0.912068366948338  -0.891921127284569 
##                   1                   1                   1 
##  -0.880871723252545  -0.873262111744435  -0.832043296117832 
##                   1                   1                   1 
##  -0.814968708869917  -0.797089525071965  -0.795339117255372 
##                   1                   1                   1 
##  -0.776776621764597   -0.69095383969683  -0.649471646796233 
##                   1                   1                   1 
##  -0.649010077708898  -0.615989907707918  -0.542888255010254 
##                   1                   1                   1 
##  -0.500696596002705  -0.452783972553158  -0.433310317456782 
##                   1                   1                   1 
##  -0.429513109491881  -0.424810283377287  -0.418980099421959 
##                   1                   1                   1 
##  -0.412519887482398  -0.411510832795067  -0.376702718583628 
##                   1                   1                   1 
##  -0.299215117897316  -0.289461573688223  -0.282173877322451 
##                   1                   1                   1 
##  -0.279346281854269  -0.275778029088027  -0.235706556439501 
##                   1                   1                   1 
##  -0.227328691424755  -0.224267885278309   -0.21951562675344 
##                   1                   1                   1 
##  -0.172623502645857  -0.119168762418038  -0.117753598165951 
##                   1                   1                   1 
##  -0.115825322156954 -0.0571067743838088 -0.0548774737115786 
##                   1                   1                   1 
## -0.0110454784656636 0.00837095999603331  0.0191563916602738 
##                   1                   1                   1 
##  0.0253828675878054  0.0465803028049967   0.046726172188352 
##                   1                   1                   1 
##  0.0652881816716207   0.119717641289537   0.133336360814841 
##                   1                   1                   1 
##    0.14377148075807   0.229019590694692   0.242263480859686 
##                   1                   1                   1 
##   0.248412648872596   0.250141322854153   0.252223448156132 
##                   1                   1                   1 
##   0.257338377155533   0.266137361672105   0.358728895971352 
##                   1                   1                   1 
##    0.36594112304922   0.377395645981701   0.435683299355719 
##                   1                   1                   1 
##   0.503607972233726   0.560746090888056   0.576718781896486 
##                   1                   1                   1 
##    0.59625901661066   0.618243293566247   0.646674390495345 
##                   1                   1                   1 
##    0.66413569989411   0.726750747385451    0.77214218580453 
##                   1                   1                   1 
##   0.781859184600258   0.804189509744908    0.83204712857239 
##                   1                   1                   1 
##   0.992160365445798   0.996543928544126   0.996986860909106 
##                   1                   1                   1 
##    1.08576936214569    1.10096910219409     1.1519117540872 
##                   1                   1                   1 
##    1.15653699715018    1.23830410085338    1.25408310644997 
##                   1                   1                   1 
##     1.2560188173061    1.29931230256343    1.45598840106634 
##                   1                   1                   1 
##    1.62544730346494    1.67829720781629    1.75790308981071 
##                   1                   1                   1 
##    2.44136462889459 
##                   1

We also often want to obtain percentages (i.e., the proportion of observations falling into each category). We can obtain this information by wrapping our table function in a prop.table function:

prop.table(table(a))
## a
##    1    2    3    4    5 
## 0.12 0.32 0.08 0.28 0.20

The result is a “proportion” table, showing the proportion of observations in each category. If we want percentages, we can simply multiply the resulting table by 100:

prop.table(table(a)) * 100
## a
##  1  2  3  4  5 
## 12 32  8 28 20

To get frequencies and proportions (or percentages) together, we can bind the two tables:

cbind(table(a), prop.table(table(a)))
##   [,1] [,2]
## 1    3 0.12
## 2    8 0.32
## 3    2 0.08
## 4    7 0.28
## 5    5 0.20
rbind(table(a), prop.table(table(a)))
##         1    2    3    4   5
## [1,] 3.00 8.00 2.00 7.00 5.0
## [2,] 0.12 0.32 0.08 0.28 0.2

In addition to these basic (univariate) tabulation functions, we can also tabulate in two or more dimensions. To obtain simple crosstabulations, we can still use table:

b <- rep(c(1, 2), length = 25)
table(a, b)
##    b
## a   1 2
##   1 0 3
##   2 5 3
##   3 1 1
##   4 5 2
##   5 2 3

The result is a crosstable with the first requested variable a as rows and the second as columns. With more than two variables, the table is harder to read:

c <- rep(c(3, 4, 5), length = 25)
table(a, b, c)
## , , c = 3
## 
##    b
## a   1 2
##   1 0 1
##   2 3 1
##   3 0 1
##   4 1 0
##   5 1 1
## 
## , , c = 4
## 
##    b
## a   1 2
##   1 0 0
##   2 2 2
##   3 0 0
##   4 2 2
##   5 0 0
## 
## , , c = 5
## 
##    b
## a   1 2
##   1 0 2
##   2 0 0
##   3 1 0
##   4 2 0
##   5 1 2

R supplies two additional functions that make reading these kinds of tables easier. The ftable function attempts to collapse the previous result into a more readable format:

ftable(a, b, c)
##     c 3 4 5
## a b        
## 1 1   0 0 0
##   2   1 0 2
## 2 1   3 2 0
##   2   1 2 0
## 3 1   0 0 1
##   2   1 0 0
## 4 1   1 2 2
##   2   0 2 0
## 5 1   1 0 1
##   2   1 0 2

The xtabs function provides an alternative way of requesting tabulations. This uses R's formula data structure (see 'formulas.r'). A righthand-only formula produces the same result as table:

xtabs(~a + b)
##    b
## a   1 2
##   1 0 3
##   2 5 3
##   3 1 1
##   4 5 2
##   5 2 3
xtabs(~a + b + c)
## , , c = 3
## 
##    b
## a   1 2
##   1 0 1
##   2 3 1
##   3 0 1
##   4 1 0
##   5 1 1
## 
## , , c = 4
## 
##    b
## a   1 2
##   1 0 0
##   2 2 2
##   3 0 0
##   4 2 2
##   5 0 0
## 
## , , c = 5
## 
##    b
## a   1 2
##   1 0 2
##   2 0 0
##   3 1 0
##   4 2 0
##   5 1 2

Table margins

With a crosstable, we can also add table margins using addmargins:

x <- table(a, b)
addmargins(x)
##      b
## a      1  2 Sum
##   1    0  3   3
##   2    5  3   8
##   3    1  1   2
##   4    5  2   7
##   5    2  3   5
##   Sum 13 12  25

Proportions in crosstables

As with a one-dimensional table, we can calculate proportions from an k-dimensional table:

prop.table(table(a, b))
##    b
## a      1    2
##   1 0.00 0.12
##   2 0.20 0.12
##   3 0.04 0.04
##   4 0.20 0.08
##   5 0.08 0.12

The default result is a table with proportions of the entire table. We can calculate row percentages with the margin parameter set to 1:

prop.table(table(a, b), 1)
##    b
## a        1      2
##   1 0.0000 1.0000
##   2 0.6250 0.3750
##   3 0.5000 0.5000
##   4 0.7143 0.2857
##   5 0.4000 0.6000

We can calculate column percentages with the margin parameter set to 2:

prop.table(table(a, b), 2)
##    b
## a         1       2
##   1 0.00000 0.25000
##   2 0.38462 0.25000
##   3 0.07692 0.08333
##   4 0.38462 0.16667
##   5 0.15385 0.25000

The curve Function

One of the many handy, and perhaps underappreciated, functions in R is curve. It is a neat little function that provides mathematical plotting, e.g., to plot functions. This tutorial shows some basic functionality.

The curve function takes, as its first argument, an R expression. That expression should be a mathematical function in terms of x. For example, if we wanted to plot the line y=x, we would simply type:

curve((x))

plot of chunk unnamed-chunk-1

Note: We have to type (x) rather than just x.

We can also specify an add parameter to indicate whether to draw the curve on a new plotting device or add to a previous plot. For example, if we wanted to overlay the function y=x^2 on top of y=x we could type:

curve((x))
curve(x^2, add = TRUE)

plot of chunk unnamed-chunk-2

We aren't restricted to using curve by itself either. We could plot some data and then use curve to draw a y=x line on top of it:

set.seed(1)
x <- rnorm(100)
y <- x + rnorm(100)
plot(y ~ x)
curve((x), add = TRUE)

plot of chunk unnamed-chunk-3

And, like all other plotting functions, curve accepts graphical parameters. So we could redraw our previous graph with gray points and a thick red curve:

plot(y ~ x, col = "gray", pch = 15)
curve((x), add = TRUE, col = "red", lwd = 2)

plot of chunk unnamed-chunk-4

We could also call these in the opposite order (replacing plot with points):

curve((x), col = "red", lwd = 2)
points(y ~ x, col = "gray", pch = 15)

plot of chunk unnamed-chunk-5

Note: The plots are different because calling curve without xlim and ylim plots means that R doesn't know that we're going to add data outside the plotting region when we call points.

We can also use curve (as we would line or points) to draw points rather than a line:

curve(x^2, type = "p")

plot of chunk unnamed-chunk-6

We can also specify to and from arguments to determine over what range the curve will be drawn. These are independent of xlim and ylim. So we could draw a curve over a small range on a much larger plotting region:

curve(x^3, from = -2, to = 2, xlim = c(-5, 5), ylim = c(-9, 9))

plot of chunk unnamed-chunk-7

Because curve accepts any R expression as its first argument (as long as that expression resolves to a mathematical function of x), we can overlay all kinds of different curves:

curve((x), from = -2, to = 2, lwd = 2)
curve(0 * x, add = TRUE, col = "blue")
curve(0 * x + 1.5, add = TRUE, col = "green")
curve(x^3, add = TRUE, col = "red")
curve(-3 * (x + 2), add = TRUE, col = "orange")

plot of chunk unnamed-chunk-8

These are some relatively basic examples, but they highlight the utility of curve when we simply want to plot a function, it is much easier than generating data vectors that correspond to a function simply for the purposes of plotting.

Variables

Working with objects in R will become tedious if we don't give those objects names to refer to them in subsequent analysis. In R, we can “assign” an object a name that we can then reference subsequently. For example, rather than see the result of the expression 2+2, we can store the result of this expression and look at it later:

a <- 2 + 2

To see the value of the result, we simply call our variable's name:

a
## [1] 4

Thus the <- (less than and minus symbols together) mean assign the right-hand side to the name on the left-hand side. We can get the same result using = (an equal sign):

a = 2 + 2
a
## [1] 4

We can also, much more uncommonly, produce the same result by reversing the order of the statement and using a different symbol:

a <- 2 + 2
a
## [1] 4

This is very uncommon, though. The <- is the preferred assignment operator. When we assign an expression to a variable name, the result of the evaluated expression is saved. Thus, when we call a again later, we don't see 2+2 but instead see 4. We can overwrite the value stored in a variable by simply assigning something new to that variable:

a <- 2 + 2
a <- 3
a
## [1] 3

We can also copy a variable into a different name:

b <- a
b
## [1] 3

We may decide we don't need a variable any more and it is possible to remove that variable from the R environment using rm:

rm(a)

Sometimes we forget what we've done and want to see what variables we have floating around in our R environment. We can see them with ls:

ls()
##   [1] "a1"            "a2"            "allout"        "amat"         
##   [5] "b"             "b1"            "betas"         "between"      
##   [9] "bin"           "bmat"          "bootcoefs"     "c"            
##  [13] "c1"            "c2"            "c3"            "change"       
##  [17] "ci67"          "ci95"          "ci99"          "cmat"         
##  [21] "coef.mi"       "coefs.amelia"  "condmeans_x"   "condmeans_x2" 
##  [25] "condmeans_y"   "cumprobs"      "d"             "d1"           
##  [29] "d2"            "d3"            "d4"            "d5"           
##  [33] "df1"           "df2"           "dist"          "e"            
##  [37] "e1"            "e2"            "e3"            "e4"           
##  [41] "e5"            "englebert"     "f"             "fit1"         
##  [45] "fit2"          "fit3"          "FUN"           "g"            
##  [49] "g1"            "g2"            "grandm"        "grandse"      
##  [53] "grandvar"      "h"             "height"        "i"            
##  [57] "imp"           "imp.amelia"    "imp.mi"        "imp.mice"     
##  [61] "lm"            "lm.amelia.out" "lm.mi.out"     "lm.mice.out"  
##  [65] "lm1"           "lm2"           "lmfit"         "lmp"          
##  [69] "localfit"      "localp"        "logodds"       "logodds_lower"
##  [73] "logodds_se"    "logodds_upper" "m"             "m1"           
##  [77] "m2"            "m2a"           "m2b"           "m3a"          
##  [81] "m3b"           "me"            "me_se"         "means"        
##  [85] "mmdemo"        "model1"        "model2"        "myboot"       
##  [89] "mydf"          "mydf2"         "myformula"     "myttest"      
##  [93] "myttest2"      "myttest3"      "n"             "n1"           
##  [97] "n2"            "n3"            "new1"          "newdata"      
## [101] "newdata1"      "newdata2"      "newdf"         "newvar"       
## [105] "nx"            "ologit"        "ols"           "ols1"         
## [109] "ols2"          "ols3"          "ols3b"         "ols4"         
## [113] "ols5"          "ols5a"         "ols5b"         "ols6"         
## [117] "ols6a"         "ols6b"         "oprobit"       "oprobprobs"   
## [121] "out"           "p"             "p1"            "p2"           
## [125] "p2a"           "p2b"           "p3a"           "p3b"          
## [129] "p3b.fitted"    "part1"         "part2"         "plogclass"    
## [133] "plogprobs"     "pool.mice"     "ppcurve"       "pred1"        
## [137] "s"             "s.amelia"      "s.mi"          "s.mice"       
## [141] "s.orig"        "s.real"        "s1"            "s2"           
## [145] "s3"            "search"        "ses"           "ses.amelia"   
## [149] "sigma"         "slope"         "slopes"        "sm1"          
## [153] "sm2"           "smydf"         "sx"            "sy"           
## [157] "tmp1"          "tmp2"          "tmp3"          "tmp4"         
## [161] "tmpdata"       "tmpdf"         "tmpsplit"      "tmpx"         
## [165] "tmpz"          "tr"            "val"           "valcol"       
## [169] "w"             "weight"        "within"        "x"            
## [173] "X"             "x1"            "x1cut"         "x1t"          
## [177] "x2"            "X2"            "x2t"           "x3"           
## [181] "x4"            "x5"            "x6"            "xseq"         
## [185] "y"             "y1"            "y1s"           "y2"           
## [189] "y2s"           "y3"            "y3s"           "y4"           
## [193] "y5"            "y6"            "yt"            "z"            
## [197] "z1"            "z2"            "z5"            "z6"

This returns a character vector containing all of the names for all named objects currently in our R environment. It is also possible to remove ALL variables in our current R session. You can do that with the following:

# rm(list=ls())

Note: This is usually an option on the RGui dropdown menus and should only be done if you really want to remove everything. Sometimes you can also see an expression like:

b <- NULL

This expression does not remove the object, but instead makes its value NULL. NULL is different from missing (NA) because R (generally) ignores a NULL value whenever it sees it. You can see this in the difference between the following two vectors:

c(1, 2, NULL)
## [1] 1 2
c(1, 2, NA)
## [1]  1  2 NA

The first has two elements and the second has three. It is also possible to use the assign function to assign a value to name:

assign("x", 3)
x
## [1] 3

This is not common in interactive use of R but can be helpful at more advanced levels.

Variable naming rules

R has some relatively simple rules governing how objects can be named: (1) R object names are case sensitive, so a is not the same as A. This applies to objects and functions. (2) R object names (generally) must start with a letter or a period. (3) R object names can contain letters, numbers, periods (.), and underscores (_). (4) The names of R objects can be just about any length, but anything over about 10 characters gets annoying to type. CAUTION: We can violate some of these restrictions by naming things with backticks, but this can be confusing:

f <- 2
f
## [1] 2
f <- 3
f
## [1] 3

That makes sense and can allow us to name variables that start with a number. Then to call objects with these noncompliant names, we need to use the backticks:

`1f` <- 3
# Then try typing `1f` (with the backticks)

If we just called 1f, we would get an error. But this also means we can name objects with just a number as a name:

`4` <- 5
4
## [1] 4
# Then try typing `4` (with the backticks)

Which is kind of weird. It is best avoided.

Vector Indexing

An important aspect of working with R objects is knowing how to “index” them Indexing means selecting a subset of the elements in order to use them in further analysis or possibly change them Here we focus just on three kinds of vector indexing: positional, named reference, and logical Any of these indexing techniques works the same for all classes of vectors

Positional indexing

If we start with a simple vector, we can extract each element from the vector by placing its position in brackets:

c("a", "b", "c")[1]
## [1] "a"
c("a", "b", "c")[2]
## [1] "b"
c("a", "b", "c")[3]
## [1] "c"

Indices in R start at 1 for the first item in the vector and continue up to the length of the vector. (Note: In some languages, indices start with the first item being indexed as 0.) This means that we can even index a one-element vector:

4[1]
## [1] 4

But, we will get a missing value if we try to index outside the length of a vector:

length(c(1:3))
## [1] 3
c(1:3)[9]
## [1] NA

Positional indices can also involve an R expression For example, you may want to extract the last element of a vector of unknown length To do that, you can embed the length function in the the [] brackets.

a <- 4:12
a[length(a)]
## [1] 12

Or, you can express any other R expression, for example to get the second-to-last element:

a[length(a) - 1]
## [1] 11

It is also possible to extra multiple elements from a vector, such as the first two elements:

a[1:2]
## [1] 4 5

You can use any vector of element positions:

a[c(1, 3, 5)]
## [1] 4 6 8

This means that you could also return the same element multiple times:

a[c(1, 1, 1, 2, 2, 1)]
## [1] 4 4 4 5 5 4

But note that positions outside of the length of vector will be returned as missing values:

a[c(5, 25, 26)]
## [1]  8 NA NA

It is also possible to index a vector, less a vector of specified elements, using the - symbol For example, to get all elements except the first, on could simply index with -1:

a[-1]
## [1]  5  6  7  8  9 10 11 12

Or, to obtain all elements except the last element, we can combine - with length:

a[-length(a)]
## [1]  4  5  6  7  8  9 10 11

Or, to obtain all elements except the second and third:

a[-c(2, 3)]
## [1]  4  7  8  9 10 11 12

Note: While in general 2:3 is the same as c(2,3), this is not the case in indexing

Named indexing

A second approach to indexing that is not particularly common for vectors is named indexing Vector elements can assigned names, such that each element has a value but also a name attached to it:

b <- c(x = 1, y = 2, z = "4")
b
##   x   y   z 
## "1" "2" "4"

This is the same as:

b <- c(x = 1, y = 2, z = "4")
b
##   x   y   z 
## "1" "2" "4"

In this type of vector we can still use positional indexing:

b[1]
##   x 
## "1"

But we can also index based on the names of the vector elements:

b["x"]
##   x 
## "1"

And, just with positional indexing, we can extract multiple elements at once:

b[c("x", "z")]
##   x   z 
## "1" "4"

But, it's not possible to use the - indexing that we used with element positions. For example, b[-'x'] would return an error. If a vector has names, this provides a way to extract elements without knowing their relative position in the order of vector elements. If you want to know which name is in which position, we can also get just the names of the vector elements:

names(b)
## [1] "x" "y" "z"

And we can use positional indexing on the names(b) vector, e.g. to get the first element's name:

names(b)[1]
## [1] "x"

Logical indexing

The final way to index a vector involves logicals. Positional indexing allowed us to use any R expression to extract one or more elements. Logical indexing allows us to extract elements that meet specified criteria, as specified by an R logical expression. Thus, with a given vector, we could, for example, extract elements that are equal to a particular value:

c <- 10:3
c[c == 5]
## [1] 5

This works by first constructing a logical vector and then using that to return elements where the logical is TRUE:

c == 5
## [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
c[c == 5]
## [1] 5

We can use an exclamation point (!) to negate the logical and thus return an opposite set of vector elements This is similar to the - indexing from positional indexing:

!c == 5
## [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
c[!c == 5]
## [1] 10  9  8  7  6  4  3

We do not need to restrict ourselves to logical equivalences. We can also use other comparators:

c[c > 5]
## [1] 10  9  8  7  6
c[c <= 7]
## [1] 7 6 5 4 3

We can also use boolean operators (i.e., AND &, OR |) to combine multiple criteria:

c < 9 & c > 4
## [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
c[c < 9 & c > 4]
## [1] 8 7 6 5
c > 8 | c == 3
## [1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE
c[c > 8 | c == 3]
## [1] 10  9  3

Here we can see how different logical criteria translate into a logical vector that is then used to index our target vector Some potentially unexpected behavior can happen if we try to index with a logical vector of a different length than our target vector:

c[TRUE]  #' returns all elements
## [1] 10  9  8  7  6  5  4  3
c[c(TRUE, TRUE)]  #' returns all elements
## [1] 10  9  8  7  6  5  4  3
c[FALSE]  #' returns an empty vector
## integer(0)

Just with positional indexing, if the logical vector is longer than our target vector, missing values will be appended to the end:

d <- 1:3
d[c(TRUE, TRUE, TRUE, TRUE)]
## [1]  1  2  3 NA

Because 0 and 1 values can be coerced to logicals, we can also use some shorthand to get the same indices as logical values:

as.logical(c(1, 1, 0))
## [1]  TRUE  TRUE FALSE
d[c(TRUE, TRUE, FALSE)]
## [1] 1 2
d[as.logical(c(1, 1, 0))]
## [1] 1 2

Blank index

Note: A blank index like e[] is treated specially in R. It refers to all elements in a vector.

e <- 1:10
e[]
##  [1]  1  2  3  4  5  6  7  8  9 10

This is of course redundant to just saying e, but might produce unexpected results during assignment:

e[] <- 0
e
##  [1] 0 0 0 0 0 0 0 0 0 0

This replaces all values of e with 0, which may or may not be intended.

Vectors

An important, if not the most important, object in the R language is the vector. A vector is a set of items connected together. Building a vector is easy using the c operator:

c(1, 2, 3)
## [1] 1 2 3

This combines three items - 1 and 2 and 3 - into a vector. The same result is possible with the : (colon) operators:

1:3
## [1] 1 2 3

The two can also be combined:

c(1:3, 4)
## [1] 1 2 3 4
c(1:2, 4:5, 6)
## [1] 1 2 4 5 6
1:4
## [1] 1 2 3 4

And colon-built sequences can be in any direction:

4:1
## [1] 4 3 2 1
10:2
## [1] 10  9  8  7  6  5  4  3  2

And we can also reverse the order of a vector using rev:

1:10
##  [1]  1  2  3  4  5  6  7  8  9 10
rev(1:10)
##  [1] 10  9  8  7  6  5  4  3  2  1

Arbitrary numeric sequences can also be built with seq:

seq(from = 1, to = 10)
##  [1]  1  2  3  4  5  6  7  8  9 10
seq(2, 25)
##  [1]  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
## [24] 25

seq accepts a number of optional arguments, including: by, which controls the spacing between vector elements

seq(1, 10, by = 2)
## [1] 1 3 5 7 9
seq(0, 1, by = 0.1)
##  [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

length.out, which controls the length of the resulting sequence

seq(0, 1, length.out = 11)
##  [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

A related function seq_along produces a sequence the length of another vector:

seq_along(c(1, 4, 5))
## [1] 1 2 3

This is shorthand for combining seq with the length function:

length(c(1, 4, 5))
## [1] 3
seq(1, length(c(1, 4, 5)))
## [1] 1 2 3

It's also possible to create repeated sequences using rep:

rep(1, times = 5)
## [1] 1 1 1 1 1

This also allows us to repeat shorter vectors into longer vectors:

rep(c(1, 2), times = 4)
## [1] 1 2 1 2 1 2 1 2

If we use an each parameter instead of a times parameter, we can get a different result:

rep(c(1, 2), each = 4)
## [1] 1 1 1 1 2 2 2 2

Finally, we might want to repeat a vector into a vector that is not a multiple of the original vector length. For example, we might want to alternate 1 and 2 for five values. We can use the length.out parameter:

rep(c(1, 2), length.out = 5)
## [1] 1 2 1 2 1

These repetitions can be helpful when we need to categorize data into groups.

Vector classses

The above vectors are numeric, but vectors can be other classes, like character:

c("a", "b")
## [1] "a" "b"

Sequences of dates are also possible, using Date classes:

seq(as.Date("1999/1/1"), as.Date("1999/3/5"), "week")
##  [1] "1999-01-01" "1999-01-08" "1999-01-15" "1999-01-22" "1999-01-29"
##  [6] "1999-02-05" "1999-02-12" "1999-02-19" "1999-02-26" "1999-03-05"
seq(as.Date("1999/1/1"), as.Date("1999/3/5"), "day")
##  [1] "1999-01-01" "1999-01-02" "1999-01-03" "1999-01-04" "1999-01-05"
##  [6] "1999-01-06" "1999-01-07" "1999-01-08" "1999-01-09" "1999-01-10"
## [11] "1999-01-11" "1999-01-12" "1999-01-13" "1999-01-14" "1999-01-15"
## [16] "1999-01-16" "1999-01-17" "1999-01-18" "1999-01-19" "1999-01-20"
## [21] "1999-01-21" "1999-01-22" "1999-01-23" "1999-01-24" "1999-01-25"
## [26] "1999-01-26" "1999-01-27" "1999-01-28" "1999-01-29" "1999-01-30"
## [31] "1999-01-31" "1999-02-01" "1999-02-02" "1999-02-03" "1999-02-04"
## [36] "1999-02-05" "1999-02-06" "1999-02-07" "1999-02-08" "1999-02-09"
## [41] "1999-02-10" "1999-02-11" "1999-02-12" "1999-02-13" "1999-02-14"
## [46] "1999-02-15" "1999-02-16" "1999-02-17" "1999-02-18" "1999-02-19"
## [51] "1999-02-20" "1999-02-21" "1999-02-22" "1999-02-23" "1999-02-24"
## [56] "1999-02-25" "1999-02-26" "1999-02-27" "1999-02-28" "1999-03-01"
## [61] "1999-03-02" "1999-03-03" "1999-03-04" "1999-03-05"

But vectors can only have one class, so elements will be coerced, such that:

c(1, 2, "c")
## [1] "1" "2" "c"

produces a character vector

Empty vectors

We can create vectors of different classes using the appropriate functions: (1) The function numeric produces numeric vectors:

numeric()
## numeric(0)

The result is an empty numeric vector. If we supply a length parameter:

numeric(length = 10)
##  [1] 0 0 0 0 0 0 0 0 0 0

The result is a vector of zeroes. (2) The function character produces an empty character vector:

character()
## character(0)

We can again supply a length argument to produce a vector of empty chracter strings:

character(length = 10)
##  [1] "" "" "" "" "" "" "" "" "" ""

(3) The function logical produces an empty logical vector:

logical()
## logical(0)

Or, with a length parameter, a vector of FALSE values:

logical(length = 10)
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

These functions may seem kind of pointless right now. But they are useful in large projects. Filling in the values of a vector “initialized” (e.g., with numeric, character, or logical) is much faster than building a vector with c(). This is hard to observe at this scale (a few elements) but matters with bigger data.